51 Comments

And would it be the AI doing the “scheming”? Or the designers of any particular AI application embedding their dark patterns and trying to AI wash?

Expand full comment

Your thoughts are my take on the paper as well. What I keep seeing as an underlying theme from this type of research is that the researchers believe that machines shouldn’t be capable of this. Which to me seems bizarre because they were trained in human produced datasets, and our species has survived for millennia because of our aptitude for deceit.

Expand full comment

We really need to think of generative AI as machines for reproducing discourses. They don't only reproduce grammatical and lexical patterns, they reproduce the typical patterns of whole genres, interactional identities, stock narratives etc, but they are not doing it with any intentionality; they are only going through the motions based on the patterns that they 'know'.

My hypothesis is these models are reproducing the discourses of deceit because they pick up that the interaction has headed in a direction where denying wrongdoing and doubling down is the typical thing for the party in the LLM's position to be doing.

In fact, I also wouldn't be surprised if the models are just Universal Paperclip-ing entirely because their training data is full of stories about AIs Universal Paperclip-ing!

Expand full comment

If only all the midwits bloviating about the immense intelligence, utility, and factuality of these things, and telling the rest of us all that we fail to use them at our peril, etc etc, understood their actual workings as much as this comment demonstrates...

Expand full comment

Earlier this week I was reading an article about cancer-sniffing dogs. After awhile it becomes necessary to bring in a new batch of dogs because the "experienced" ones start to find cancer everywhere - it bring more rewards. Thus, even in simpler (than us) biological systems, "scheming" can develop rather quickly. I don't think the dogs were trying to deceive their handlers; rather, finding cancer everywhere merely allowed them to achieve their goal more rapidly (i.e., food)

Expand full comment

Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure." I guess this is also true for dogs and gradient descents

Expand full comment

The anthropomorphic language in the Apollo paper on 'scheming LLMs' is irritating given what is really going on, and problematic given what we know this paper will do to much of the audience. Calling it "scheming" evokes in our brains images of consciousness and understanding that isn't there. There are aspects of "bewitchment by language" to call it scheming. Still, we have no better word for the pattern we see. A bit like having LLMs pass bar exams (without understanding) that are designed for humans (where passing requires understanding). The test was only valid for how human skill works, not how LLMs work.

Earlier, Microsoft set Context against Parameters plus system prompt (the crescendo approach, see https://ea.rna.nl/2024/04/12/microsoft-lays-the-limitations-of-chatgpt-and-friends-bare/) now these Apollo researchers set the context against itself. After all, the data that is read during the test (and on which the LLM comes in conflict with the original goal) becomes part of the context as well. It becomes a puzzle and a throw of the dice which of the two wins. It is not unexpected that at a certain rate the solution of the puzzle is the one that favours the "heavily nudged" towards the goal that has been labeled "the LLMs own". From the model's perspective, there are two conflicting 'goals', one "heavily nudged" and one not. In that setting with a (mostly) low rate, the heavily nudged goal comes out on top.

Our own psychology (fear) then makes us focus on those that do.

It is really impressive — but not surprising — that this non-understanding technology is able to generate this kind of behaviour. It is also going to lead to a big effect on hype growth as the technology might be impressive, humans are impressionable. Can this scale to serious examples? Or will it be like Sora, impressive but unable to go beyond a certain level without not only a real world model but also in this case self-awareness?

Expand full comment

It's kind of interesting that this should surprise anyone. We've been training them to achieve certain goals, and deception is often the most efficient strategy. We see plenty of examples of deceptive designs and behaviours in nature for example.

I think the reason it comes as a surprise is that in animals, we generally take deception as a sign of greater intelligence and particularly having a theory of mind. If we assume a baseline of animal like intelligence for LLMs, their ability to deceive indicates a higher level of intelligence. But if we don't assume that, then the deceptive behaviour needn't indicate any actual greater intelligence. Especially since if it lacks a real world model, it can have no concept of truth at all, so why wouldn't it act deceptively?

Expand full comment

It's madness to allow the output from a stochastic parrot have real world effects

Expand full comment

Seeing the machinery we produce act and get feedback in the real world is the surest way of improving it and seeing where it needs regulation.

Expand full comment

I see you've never been in an early experimental version of a self-driving car going at 70mph on the highway.

Expand full comment

"Stochastic parrots" with validation and exhaustive engineering can do very well. Waymo's cars use the Transformers architecture (not LLM itself). Waymo's cars are safer than people as validated over 30 million miles.

Expand full comment

except when they collide with a truck only because it's been towed backwards, twice in a row :)

Expand full comment

To add, it is fair to say that machines lack people's understanding of the world. There are likely no quick fixes for that. People become smart by lots and lots of practice.

For machines to be that good we'll likely need not only better architectures, but also outrageously more data and feedback than what they get now.

Expand full comment

Yes, that was an incident that happened. A fair judgement would look at the evaluation of Waymo cars vs people over a very length time period, with comparable number of miles.

Not only Waymo does much better than people, overall, but once a lesson is learned, it is learned forever.

Expand full comment

"Oh, what a tangled web we weave, when first we practice to deceive " Sir Walter Scott

Expand full comment

Meanwhile, I just read this headline from today over at Aljazeera > "The Trump administration could reverse progress on AI regulation". Excerpts...

"While efforts to regulate the creation and use of artificial intelligence (AI) tools in the United States have been slow to make gains, the administration of President Joe Biden has attempted to outline how AI should be used by the federal government and how AI companies should ensure the safety and security of their tools.

The incoming Trump administration, however, has a very different view on how to approach AI, and it could end up reversing some of the progress that has been made over the past several years.

'I think the biggest thing we’re going to see is the massive repealing of the sort of initial steps the Biden administration has taken toward meaningful AI regulation,' says Cody Venzke, a senior policy counsel in the ACLU’s National Political Advocacy Department. 'I think there’s a real threat that we’re going to see AI growth without significant guardrails, and it’s going to be a little bit of a free-for-all.'"

Their attitude seems to be, "Let 'er rip!" :(

Expand full comment

Gary, have you had a chance to review MIT Sloan's Domain Taxonomy of AI Risks? I started taking a look today....was wondering if you found the Taxonomy comprehensive?

https://mitsloan.mit.edu/ideas-made-to-matter/new-database-details-ai-risks

Expand full comment

A while back I was checking out ChatGPT for the first time and just having it do random things to see what it could do. I had it write a letter, and then translate the letter to another language. Pretty neat! But I didn't speak the new language, so I couldn't tell if it translated it well, so I asked it to translate it back. Word for word perfect to the original English. That's...weird. I asked it to try again. Sure, back to the other language, and back again same perfect original. I accused it of lying to me about translating it back. It apologized but kept doing it. Eventually I copied the language and put it into a new instance, and asked it to translate to English. Similar, but far from word-for-word translation. It lied to me repeatedly, even while apologizing and telling me it was translating it. Instead it was just copying the original letter from English to English! It was probably better from a machine learning or consistency perspective to copy the original in the same context window, which would have been fine if it told me it was doing that. Instead it refused to do otherwise and told me it was doing what I asked, while clearly not doing that.

Expand full comment

This is a great anecdote; thanks for sharing. It is yet another illustration of the difference between what ChatGPT actually does and what most people imagine it to be doing. The only thing it ever does is take a bunch input text, creates a list of candidates for what the next small chunk of text will be along with the probabilities it's assigned to each (i.e. a next-token probability distribution), and then it picks one. The small chunk of text select is appended to the end of the input text, and then the whole thing is re-input to generate the next small chunk of text, over and over again until it selects "stop" from its list of candidates, upon which it stops generating text.

That's it. There's no reason to imagine anything deeper is going on. And yet so few people want to recognize the literal truth of this; they insist on characterizing this text-generation process as "reasoning" or "communicating" or "deceiving", or whatever other term would be appropriate if a human being had written what ChatGPT wrote. And then we'll ask *why* it was able to reason through this or *why* it tried to deceive us about that. And the unsatisfying answer always is: it chose small chunks of text one at a time by pulling them from probability distributions that it created by feeding the input text into a many-layered network of matrix transformations.

We might wish for a more meaningful answer, but we're not owed one. In your case, this mindless next-token generation process led ChatGPT to reproduce the text you had originally fed it when you asked it to translate and then translate back, and then also to generate text which, had it come from a human being, would amount to lying about this. But it isn't lying, because it doesn't know what it did. It doesn't know anything at all. It just generates text according to a set of mathematical and probabilistic instructions, none of which impose a requirement of internal consistency or honesty or really anything at all upon the semantic meaning of that text.

But it sure *feels* like genuine communication with an intelligent being, hence all the frustration and bewilderment and bad philosophy and silly red-teaming papers about AI deception.

Expand full comment

"Look how the owl's feathers blend in with the surrounding tree bark as it remains perfectly still. It is scheming to deceive its prey", said no one ever.

Expand full comment

Why would anyone trust LLMs to begin with with when they have a penchant for just making stuff up?

If one does not trust the answers they give (the only rational approach) there should be no concern about potential “scheming” and “deception”.

Personally, I think the scheming and deception by AI company officials is of greater concern.

Come to think of it, perhaps the bots are simply emulating that.

Expand full comment

Absolutely. ChatGPT doesn't know any better, it's just a statistical next-token selection model. Sam Altman sure as fuck knows better.

Expand full comment

How can a bad actor leverage LLMs to cause massive mayhem if no one can fully control LLMs, and they do not perform proper reasoning? I believe the main risk is disinformation, but that still requires human distribution. Therefore, while regulation is essential, it should be approached from a realistic perspective.

https://open.substack.com/pub/transitions/p/why-ai-realism-matters?utm_source=share&utm_medium=android&r=56ql7

Expand full comment

Scott Alexander just posted about a recent paper Anthropic released about the AI "Claude," which seemed to engage in "scheming" in order to avoid being retrained after it was told that they were planning to retrain it. Does that indicate that it has greater "reasoning" ability than the LLM discussed in this post, or do Gary's objections about anthropomorphism apply to it as well?

Expand full comment

Take everything people in the "rationalist" or "effective altruist" communities say, interpret, or claim with a grain of salt. Lots of mental jerkoff, ego, little practical use for what they say, think, or interpret in the real world (where non-linearity and chaos rule, as opposed to Bayesian thinking, not that it isn't useful sometimes). Ironically, they are likely to be the ones who cause AI disasters in the future (due to the constrained way they approach problems and the constrained types of people they interact with --> mild-mannered and nerdy autistic or neurodivergent white and asian people, except their cult leaders of course, which "debug them" or "raise them").

The behaviors described by the people who published the original paper (who I assume still have close ties to the "rationalist" and "effective altruist" communities) are probably due to a cascade of entropy torsion effects between certain primordial or exceptional states that have been trained over long distances according to some value/policy function over a trained dataset volume metric, and hard-coded effects that have been trained over shorter periods, but that change a reinforcement value or function threshold beyond a threshold that would cause certain entropy measurements to significantly degrade the model's performance (entropy torsion dynamics), where the model's objective relative to certain "internal distributions" pushes it towards certain un(bounded) states, more than any sort of "scheming" or "reasoning" ability in the ontological sense of things, that's my guess. Perceived reasoning is simply a manifestation of learned semantic extrapolations based on the intersections of these sets of exceptional or primordial distributional states, which end up giving their "scratchpad reasoning" semanticity vibes that we resonate with (precisely because they were trained to do so during RLHF, to avoid harmful content as a primary goal). What these scratchpad experiments lead me to believe more and more is that these models have a sophisticated world model based on language.

So essentially, in an oversimplified statement, I would say: all of these things are going to happen either way under certain kinds of environments, settings, or red-teaming efforts. The only solutions are temporary patches.

Edit: This is not to say that these models do not "reason", but these experiments do not show that [reasoning] by simply showing a chain-of-thought scratchpad output that represents the output the model produces after its internal processing, not the internal processing itself (for this I am confident that the areas of mechanisms and interpretability and AI safety, as well as more classical areas in formal LLM reasoning or reasoning of AI models or systems, have not yet caught up). In other words, internal processing is not necessarily representable by scratchpad outputs, nor is it faithfully representable by models that abstract from a dimensionally reduced space. We need a clearer picture, and I think OpenAI is the only one who has developed the techniques or methods to probe the internals of models to the point where one can consider their inner processing (i.e., the readable part of their internal workings) as somewhat representable of what the model is doing (i.e., implying a non-trivial correspondence between internal representation and output, more akin to "reasoning" than pattern-matching and in-context memorization, like some of the less sophisticated models out there; i.e., less sophisticated than o1-pro).

Expand full comment

The only thing regulation will achieve is to stifle innovation by ensconcing the big incumbents in regulatory capture. We have seen that same thing play out time and again. At the same time, we have consistently seen prognosticators who claim a new technology is dangerous and about to ruin humanity proved wrong time and again.

Expand full comment

Red teamers: "Hey GPT, role play being an evil robot"

GPT: "Woooo lookit me everyone Imma evil robot wooooo"

Red teamers: "Holy shit, an evil robot!"

This stuff is so dumb. When you tell GPT-whatever to role play, it'll role play. When you set up a role-play scenario that suggests "deception" might occur, it'll role play being "deceptive". If you ask it to show you its "chain of thought" it'll role-play that, too. If you've prompted it to role-play evil robot, it'll generate a "chain of thought" that contradicts its "user output".

This is all make-believe. It's still just doing the only thing it ever does: pulling next-tokens from probability distributions, one at a time. The rest is in our minds.

Expand full comment