42 Comments

And would it be the AI doing the “scheming”? Or the designers of any particular AI application embedding their dark patterns and trying to AI wash?

Expand full comment

Your thoughts are my take on the paper as well. What I keep seeing as an underlying theme from this type of research is that the researchers believe that machines shouldn’t be capable of this. Which to me seems bizarre because they were trained in human produced datasets, and our species has survived for millennia because of our aptitude for deceit.

Expand full comment

We really need to think of generative AI as machines for reproducing discourses. They don't only reproduce grammatical and lexical patterns, they reproduce the typical patterns of whole genres, interactional identities, stock narratives etc, but they are not doing it with any intentionality; they are only going through the motions based on the patterns that they 'know'.

My hypothesis is these models are reproducing the discourses of deceit because they pick up that the interaction has headed in a direction where denying wrongdoing and doubling down is the typical thing for the party in the LLM's position to be doing.

In fact, I also wouldn't be surprised if the models are just Universal Paperclip-ing entirely because their training data is full of stories about AIs Universal Paperclip-ing!

Expand full comment

If only all the midwits bloviating about the immense intelligence, utility, and factuality of these things, and telling the rest of us all that we fail to use them at our peril, etc etc, understood their actual workings as much as this comment demonstrates...

Expand full comment

Earlier this week I was reading an article about cancer-sniffing dogs. After awhile it becomes necessary to bring in a new batch of dogs because the "experienced" ones start to find cancer everywhere - it bring more rewards. Thus, even in simpler (than us) biological systems, "scheming" can develop rather quickly. I don't think the dogs were trying to deceive their handlers; rather, finding cancer everywhere merely allowed them to achieve their goal more rapidly (i.e., food)

Expand full comment

Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure." I guess this is also true for dogs and gradient descents

Expand full comment

The anthropomorphic language in the Apollo paper on 'scheming LLMs' is irritating given what is really going on, and problematic given what we know this paper will do to much of the audience. Calling it "scheming" evokes in our brains images of consciousness and understanding that isn't there. There are aspects of "bewitchment by language" to call it scheming. Still, we have no better word for the pattern we see. A bit like having LLMs pass bar exams (without understanding) that are designed for humans (where passing requires understanding). The test was only valid for how human skill works, not how LLMs work.

Earlier, Microsoft set Context against Parameters plus system prompt (the crescendo approach, see https://ea.rna.nl/2024/04/12/microsoft-lays-the-limitations-of-chatgpt-and-friends-bare/) now these Apollo researchers set the context against itself. After all, the data that is read during the test (and on which the LLM comes in conflict with the original goal) becomes part of the context as well. It becomes a puzzle and a throw of the dice which of the two wins. It is not unexpected that at a certain rate the solution of the puzzle is the one that favours the "heavily nudged" towards the goal that has been labeled "the LLMs own". From the model's perspective, there are two conflicting 'goals', one "heavily nudged" and one not. In that setting with a (mostly) low rate, the heavily nudged goal comes out on top.

Our own psychology (fear) then makes us focus on those that do.

It is really impressive — but not surprising — that this non-understanding technology is able to generate this kind of behaviour. It is also going to lead to a big effect on hype growth as the technology might be impressive, humans are impressionable. Can this scale to serious examples? Or will it be like Sora, impressive but unable to go beyond a certain level without not only a real world model but also in this case self-awareness?

Expand full comment

It's kind of interesting that this should surprise anyone. We've been training them to achieve certain goals, and deception is often the most efficient strategy. We see plenty of examples of deceptive designs and behaviours in nature for example.

I think the reason it comes as a surprise is that in animals, we generally take deception as a sign of greater intelligence and particularly having a theory of mind. If we assume a baseline of animal like intelligence for LLMs, their ability to deceive indicates a higher level of intelligence. But if we don't assume that, then the deceptive behaviour needn't indicate any actual greater intelligence. Especially since if it lacks a real world model, it can have no concept of truth at all, so why wouldn't it act deceptively?

Expand full comment

It's madness to allow the output from a stochastic parrot have real world effects

Expand full comment

Seeing the machinery we produce act and get feedback in the real world is the surest way of improving it and seeing where it needs regulation.

Expand full comment

"Oh, what a tangled web we weave, when first we practice to deceive " Sir Walter Scott

Expand full comment

Meanwhile, I just read this headline from today over at Aljazeera > "The Trump administration could reverse progress on AI regulation". Excerpts...

"While efforts to regulate the creation and use of artificial intelligence (AI) tools in the United States have been slow to make gains, the administration of President Joe Biden has attempted to outline how AI should be used by the federal government and how AI companies should ensure the safety and security of their tools.

The incoming Trump administration, however, has a very different view on how to approach AI, and it could end up reversing some of the progress that has been made over the past several years.

'I think the biggest thing we’re going to see is the massive repealing of the sort of initial steps the Biden administration has taken toward meaningful AI regulation,' says Cody Venzke, a senior policy counsel in the ACLU’s National Political Advocacy Department. 'I think there’s a real threat that we’re going to see AI growth without significant guardrails, and it’s going to be a little bit of a free-for-all.'"

Their attitude seems to be, "Let 'er rip!" :(

Expand full comment

Gary, have you had a chance to review MIT Sloan's Domain Taxonomy of AI Risks? I started taking a look today....was wondering if you found the Taxonomy comprehensive?

https://mitsloan.mit.edu/ideas-made-to-matter/new-database-details-ai-risks

Expand full comment

The only thing regulation will achieve is to stifle innovation by ensconcing the big incumbents in regulatory capture. We have seen that same thing play out time and again. At the same time, we have consistently seen prognosticators who claim a new technology is dangerous and about to ruin humanity proved wrong time and again.

Expand full comment

Red teamers: "Hey GPT, role play being an evil robot"

GPT: "Woooo lookit me everyone Imma evil robot wooooo"

Red teamers: "Holy shit, an evil robot!"

This stuff is so dumb. When you tell GPT-whatever to role play, it'll role play. When you set up a role-play scenario that suggests "deception" might occur, it'll role play being "deceptive". If you ask it to show you its "chain of thought" it'll role-play that, too. If you've prompted it to role-play evil robot, it'll generate a "chain of thought" that contradicts its "user output".

This is all make-believe. It's still just doing the only thing it ever does: pulling next-tokens from probability distributions, one at a time. The rest is in our minds.

Expand full comment

A while back I was checking out ChatGPT for the first time and just having it do random things to see what it could do. I had it write a letter, and then translate the letter to another language. Pretty neat! But I didn't speak the new language, so I couldn't tell if it translated it well, so I asked it to translate it back. Word for word perfect to the original English. That's...weird. I asked it to try again. Sure, back to the other language, and back again same perfect original. I accused it of lying to me about translating it back. It apologized but kept doing it. Eventually I copied the language and put it into a new instance, and asked it to translate to English. Similar, but far from word-for-word translation. It lied to me repeatedly, even while apologizing and telling me it was translating it. Instead it was just copying the original letter from English to English! It was probably better from a machine learning or consistency perspective to copy the original in the same context window, which would have been fine if it told me it was doing that. Instead it refused to do otherwise and told me it was doing what I asked, while clearly not doing that.

Expand full comment

"Look how the owl's feathers blend in with the surrounding tree bark as it remains perfectly still. It is scheming to deceive its prey", said no one ever.

Expand full comment

Would some expert PLEASE explain how we intend to regulate all the AI developers in the world? Are Western AI experts aware that America and Europe combined make up only 10 percent of the world's population????

Expand full comment

Why would anyone trust LLMs to begin with with when they have a penchant for just making stuff up?

If one does not trust the answers they give (the only rational approach) there should be no concern about potential “scheming” and “deception”.

Personally, I think the scheming and deception by AI company officials is of greater concern.

Come to think of it, perhaps the bots are simply emulating that.

Expand full comment

How can a bad actor leverage LLMs to cause massive mayhem if no one can fully control LLMs, and they do not perform proper reasoning? I believe the main risk is disinformation, but that still requires human distribution. Therefore, while regulation is essential, it should be approached from a realistic perspective.

https://open.substack.com/pub/transitions/p/why-ai-realism-matters?utm_source=share&utm_medium=android&r=56ql7

Expand full comment