Marcus on AI

And would it be the AI doing the “scheming”? Or the designers of any particular AI application embedding their dark patterns and trying to AI wash?

Expand full comment

Jonathan

Your thoughts are my take on the paper as well. What I keep seeing as an underlying theme from this type of research is that the researchers believe that machines shouldn’t be capable of this. Which to me seems bizarre because they were trained in human produced datasets, and our species has survived for millennia because of our aptitude for deceit.

Expand full comment

Miriam Malthus

We really need to think of generative AI as machines for reproducing discourses. They don't only reproduce grammatical and lexical patterns, they reproduce the typical patterns of whole genres, interactional identities, stock narratives etc, but they are not doing it with any intentionality; they are only going through the motions based on the patterns that they 'know'.

My hypothesis is these models are reproducing the discourses of deceit because they pick up that the interaction has headed in a direction where denying wrongdoing and doubling down is the typical thing for the party in the LLM's position to be doing.

In fact, I also wouldn't be surprised if the models are just Universal Paperclip-ing entirely because their training data is full of stories about AIs Universal Paperclip-ing!

Expand full comment

Matt Hawthorn

Dec 15

If only all the midwits bloviating about the immense intelligence, utility, and factuality of these things, and telling the rest of us all that we fail to use them at our peril, etc etc, understood their actual workings as much as this comment demonstrates...

Expand full comment

John Prugh

Earlier this week I was reading an article about cancer-sniffing dogs. After awhile it becomes necessary to bring in a new batch of dogs because the "experienced" ones start to find cancer everywhere - it bring more rewards. Thus, even in simpler (than us) biological systems, "scheming" can develop rather quickly. I don't think the dogs were trying to deceive their handlers; rather, finding cancer everywhere merely allowed them to achieve their goal more rapidly (i.e., food)

Expand full comment

Ttimo Cerino

Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure." I guess this is also true for dogs and gradient descents

Expand full comment

Gerben Wierda

The anthropomorphic language in the Apollo paper on 'scheming LLMs' is irritating given what is really going on, and problematic given what we know this paper will do to much of the audience. Calling it "scheming" evokes in our brains images of consciousness and understanding that isn't there. There are aspects of "bewitchment by language" to call it scheming. Still, we have no better word for the pattern we see. A bit like having LLMs pass bar exams (without understanding) that are designed for humans (where passing requires understanding). The test was only valid for how human skill works, not how LLMs work.

Earlier, Microsoft set Context against Parameters plus system prompt (the crescendo approach, see https://ea.rna.nl/2024/04/12/microsoft-lays-the-limitations-of-chatgpt-and-friends-bare/) now these Apollo researchers set the context against itself. After all, the data that is read during the test (and on which the LLM comes in conflict with the original goal) becomes part of the context as well. It becomes a puzzle and a throw of the dice which of the two wins. It is not unexpected that at a certain rate the solution of the puzzle is the one that favours the "heavily nudged" towards the goal that has been labeled "the LLMs own". From the model's perspective, there are two conflicting 'goals', one "heavily nudged" and one not. In that setting with a (mostly) low rate, the heavily nudged goal comes out on top.

Our own psychology (fear) then makes us focus on those that do.

It is really impressive — but not surprising — that this non-understanding technology is able to generate this kind of behaviour. It is also going to lead to a big effect on hype growth as the technology might be impressive, humans are impressionable. Can this scale to serious examples? Or will it be like Sora, impressive but unable to go beyond a certain level without not only a real world model but also in this case self-awareness?

Expand full comment

Joseph Rahi

It's kind of interesting that this should surprise anyone. We've been training them to achieve certain goals, and deception is often the most efficient strategy. We see plenty of examples of deceptive designs and behaviours in nature for example.

I think the reason it comes as a surprise is that in animals, we generally take deception as a sign of greater intelligence and particularly having a theory of mind. If we assume a baseline of animal like intelligence for LLMs, their ability to deceive indicates a higher level of intelligence. But if we don't assume that, then the deceptive behaviour needn't indicate any actual greater intelligence. Especially since if it lacks a real world model, it can have no concept of truth at all, so why wouldn't it act deceptively?

Expand full comment

Roumen Popov

It's madness to allow the output from a stochastic parrot have real world effects

Expand full comment

Seeing the machinery we produce act and get feedback in the real world is the surest way of improving it and seeing where it needs regulation.

Expand full comment

Roumen Popov

Dec 21

I see you've never been in an early experimental version of a self-driving car going at 70mph on the highway.

Expand full comment

Dec 21

"Stochastic parrots" with validation and exhaustive engineering can do very well. Waymo's cars use the Transformers architecture (not LLM itself). Waymo's cars are safer than people as validated over 30 million miles.

Expand full comment

Roumen Popov

Dec 22

except when they collide with a truck only because it's been towed backwards, twice in a row :)

Expand full comment

Reply (2)

Dec 23Edited

To add, it is fair to say that machines lack people's understanding of the world. There are likely no quick fixes for that. People become smart by lots and lots of practice.

For machines to be that good we'll likely need not only better architectures, but also outrageously more data and feedback than what they get now.

Expand full comment

Dec 23

Yes, that was an incident that happened. A fair judgement would look at the evaluation of Waymo cars vs people over a very length time period, with comparable number of miles.

Not only Waymo does much better than people, overall, but once a lesson is learned, it is learned forever.

Expand full comment

Paul Martin

"Oh, what a tangled web we weave, when first we practice to deceive " Sir Walter Scott

Expand full comment

Robert Keith

Meanwhile, I just read this headline from today over at Aljazeera > "The Trump administration could reverse progress on AI regulation". Excerpts...

"While efforts to regulate the creation and use of artificial intelligence (AI) tools in the United States have been slow to make gains, the administration of President Joe Biden has attempted to outline how AI should be used by the federal government and how AI companies should ensure the safety and security of their tools.

The incoming Trump administration, however, has a very different view on how to approach AI, and it could end up reversing some of the progress that has been made over the past several years.

'I think the biggest thing we’re going to see is the massive repealing of the sort of initial steps the Biden administration has taken toward meaningful AI regulation,' says Cody Venzke, a senior policy counsel in the ACLU’s National Political Advocacy Department. 'I think there’s a real threat that we’re going to see AI growth without significant guardrails, and it’s going to be a little bit of a free-for-all.'"

Their attitude seems to be, "Let 'er rip!" :(

Expand full comment

Sugarpine Press

https://mitsloan.mit.edu/ideas-made-to-matter/new-database-details-ai-risks

Gary, have you had a chance to review MIT Sloan's Domain Taxonomy of AI Risks? I started taking a look today....was wondering if you found the Taxonomy comprehensive?

Expand full comment

Mr. Doolittle

Dec 16

A while back I was checking out ChatGPT for the first time and just having it do random things to see what it could do. I had it write a letter, and then translate the letter to another language. Pretty neat! But I didn't speak the new language, so I couldn't tell if it translated it well, so I asked it to translate it back. Word for word perfect to the original English. That's...weird. I asked it to try again. Sure, back to the other language, and back again same perfect original. I accused it of lying to me about translating it back. It apologized but kept doing it. Eventually I copied the language and put it into a new instance, and asked it to translate to English. Similar, but far from word-for-word translation. It lied to me repeatedly, even while apologizing and telling me it was translating it. Instead it was just copying the original letter from English to English! It was probably better from a machine learning or consistency perspective to copy the original in the same context window, which would have been fine if it told me it was doing that. Instead it refused to do otherwise and told me it was doing what I asked, while clearly not doing that.

Expand full comment

Ben P

Dec 19Edited

This is a great anecdote; thanks for sharing. It is yet another illustration of the difference between what ChatGPT actually does and what most people imagine it to be doing. The only thing it ever does is take a bunch input text, creates a list of candidates for what the next small chunk of text will be along with the probabilities it's assigned to each (i.e. a next-token probability distribution), and then it picks one. The small chunk of text select is appended to the end of the input text, and then the whole thing is re-input to generate the next small chunk of text, over and over again until it selects "stop" from its list of candidates, upon which it stops generating text.

That's it. There's no reason to imagine anything deeper is going on. And yet so few people want to recognize the literal truth of this; they insist on characterizing this text-generation process as "reasoning" or "communicating" or "deceiving", or whatever other term would be appropriate if a human being had written what ChatGPT wrote. And then we'll ask *why* it was able to reason through this or *why* it tried to deceive us about that. And the unsatisfying answer always is: it chose small chunks of text one at a time by pulling them from probability distributions that it created by feeding the input text into a many-layered network of matrix transformations.

We might wish for a more meaningful answer, but we're not owed one. In your case, this mindless next-token generation process led ChatGPT to reproduce the text you had originally fed it when you asked it to translate and then translate back, and then also to generate text which, had it come from a human being, would amount to lying about this. But it isn't lying, because it doesn't know what it did. It doesn't know anything at all. It just generates text according to a set of mathematical and probabilistic instructions, none of which impose a requirement of internal consistency or honesty or really anything at all upon the semantic meaning of that text.

But it sure *feels* like genuine communication with an intelligent being, hence all the frustration and bewilderment and bad philosophy and silly red-teaming papers about AI deception.

Expand full comment

Matt Hawthorn

Dec 15

"Look how the owl's feathers blend in with the surrounding tree bark as it remains perfectly still. It is scheming to deceive its prey", said no one ever.

Expand full comment

Larry Jewett

Dec 14

Why would anyone trust LLMs to begin with with when they have a penchant for just making stuff up?

If one does not trust the answers they give (the only rational approach) there should be no concern about potential “scheming” and “deception”.

Personally, I think the scheming and deception by AI company officials is of greater concern.

Come to think of it, perhaps the bots are simply emulating that.

Expand full comment

https://open.substack.com/pub/transitions/p/why-ai-realism-matters?utm_source=share&utm_medium=android&r=56ql7

Ben P

Dec 19

Absolutely. ChatGPT doesn't know any better, it's just a statistical next-token selection model. Sam Altman sure as fuck knows better.

Expand full comment

Franck S. Ndzomga

Dec 14

How can a bad actor leverage LLMs to cause massive mayhem if no one can fully control LLMs, and they do not perform proper reasoning? I believe the main risk is disinformation, but that still requires human distribution. Therefore, while regulation is essential, it should be approached from a realistic perspective.

Expand full comment

Ghatanathoah

Dec 19

Scott Alexander just posted about a recent paper Anthropic released about the AI "Claude," which seemed to engage in "scheming" in order to avoid being retrained after it was told that they were planning to retrain it. Does that indicate that it has greater "reasoning" ability than the LLM discussed in this post, or do Gary's objections about anthropomorphism apply to it as well?

Expand full comment