Not yet, but it could come sooner than you think. Not because we are close to AGI, but because we already have machines that can say one thing and do something else altogether.
Your thoughts are my take on the paper as well. What I keep seeing as an underlying theme from this type of research is that the researchers believe that machines shouldn’t be capable of this. Which to me seems bizarre because they were trained in human produced datasets, and our species has survived for millennia because of our aptitude for deceit.
We really need to think of generative AI as machines for reproducing discourses. They don't only reproduce grammatical and lexical patterns, they reproduce the typical patterns of whole genres, interactional identities, stock narratives etc, but they are not doing it with any intentionality; they are only going through the motions based on the patterns that they 'know'.
My hypothesis is these models are reproducing the discourses of deceit because they pick up that the interaction has headed in a direction where denying wrongdoing and doubling down is the typical thing for the party in the LLM's position to be doing.
In fact, I also wouldn't be surprised if the models are just Universal Paperclip-ing entirely because their training data is full of stories about AIs Universal Paperclip-ing!
If only all the midwits bloviating about the immense intelligence, utility, and factuality of these things, and telling the rest of us all that we fail to use them at our peril, etc etc, understood their actual workings as much as this comment demonstrates...
Earlier this week I was reading an article about cancer-sniffing dogs. After awhile it becomes necessary to bring in a new batch of dogs because the "experienced" ones start to find cancer everywhere - it bring more rewards. Thus, even in simpler (than us) biological systems, "scheming" can develop rather quickly. I don't think the dogs were trying to deceive their handlers; rather, finding cancer everywhere merely allowed them to achieve their goal more rapidly (i.e., food)
The anthropomorphic language in the Apollo paper on 'scheming LLMs' is irritating given what is really going on, and problematic given what we know this paper will do to much of the audience. Calling it "scheming" evokes in our brains images of consciousness and understanding that isn't there. There are aspects of "bewitchment by language" to call it scheming. Still, we have no better word for the pattern we see. A bit like having LLMs pass bar exams (without understanding) that are designed for humans (where passing requires understanding). The test was only valid for how human skill works, not how LLMs work.
Earlier, Microsoft set Context against Parameters plus system prompt (the crescendo approach, see https://ea.rna.nl/2024/04/12/microsoft-lays-the-limitations-of-chatgpt-and-friends-bare/) now these Apollo researchers set the context against itself. After all, the data that is read during the test (and on which the LLM comes in conflict with the original goal) becomes part of the context as well. It becomes a puzzle and a throw of the dice which of the two wins. It is not unexpected that at a certain rate the solution of the puzzle is the one that favours the "heavily nudged" towards the goal that has been labeled "the LLMs own". From the model's perspective, there are two conflicting 'goals', one "heavily nudged" and one not. In that setting with a (mostly) low rate, the heavily nudged goal comes out on top.
Our own psychology (fear) then makes us focus on those that do.
It is really impressive — but not surprising — that this non-understanding technology is able to generate this kind of behaviour. It is also going to lead to a big effect on hype growth as the technology might be impressive, humans are impressionable. Can this scale to serious examples? Or will it be like Sora, impressive but unable to go beyond a certain level without not only a real world model but also in this case self-awareness?
It's kind of interesting that this should surprise anyone. We've been training them to achieve certain goals, and deception is often the most efficient strategy. We see plenty of examples of deceptive designs and behaviours in nature for example.
I think the reason it comes as a surprise is that in animals, we generally take deception as a sign of greater intelligence and particularly having a theory of mind. If we assume a baseline of animal like intelligence for LLMs, their ability to deceive indicates a higher level of intelligence. But if we don't assume that, then the deceptive behaviour needn't indicate any actual greater intelligence. Especially since if it lacks a real world model, it can have no concept of truth at all, so why wouldn't it act deceptively?
Meanwhile, I just read this headline from today over at Aljazeera > "The Trump administration could reverse progress on AI regulation". Excerpts...
"While efforts to regulate the creation and use of artificial intelligence (AI) tools in the United States have been slow to make gains, the administration of President Joe Biden has attempted to outline how AI should be used by the federal government and how AI companies should ensure the safety and security of their tools.
The incoming Trump administration, however, has a very different view on how to approach AI, and it could end up reversing some of the progress that has been made over the past several years.
'I think the biggest thing we’re going to see is the massive repealing of the sort of initial steps the Biden administration has taken toward meaningful AI regulation,' says Cody Venzke, a senior policy counsel in the ACLU’s National Political Advocacy Department. 'I think there’s a real threat that we’re going to see AI growth without significant guardrails, and it’s going to be a little bit of a free-for-all.'"
Gary, have you had a chance to review MIT Sloan's Domain Taxonomy of AI Risks? I started taking a look today....was wondering if you found the Taxonomy comprehensive?
The only thing regulation will achieve is to stifle innovation by ensconcing the big incumbents in regulatory capture. We have seen that same thing play out time and again. At the same time, we have consistently seen prognosticators who claim a new technology is dangerous and about to ruin humanity proved wrong time and again.
Red teamers: "Hey GPT, role play being an evil robot"
GPT: "Woooo lookit me everyone Imma evil robot wooooo"
Red teamers: "Holy shit, an evil robot!"
This stuff is so dumb. When you tell GPT-whatever to role play, it'll role play. When you set up a role-play scenario that suggests "deception" might occur, it'll role play being "deceptive". If you ask it to show you its "chain of thought" it'll role-play that, too. If you've prompted it to role-play evil robot, it'll generate a "chain of thought" that contradicts its "user output".
This is all make-believe. It's still just doing the only thing it ever does: pulling next-tokens from probability distributions, one at a time. The rest is in our minds.
A while back I was checking out ChatGPT for the first time and just having it do random things to see what it could do. I had it write a letter, and then translate the letter to another language. Pretty neat! But I didn't speak the new language, so I couldn't tell if it translated it well, so I asked it to translate it back. Word for word perfect to the original English. That's...weird. I asked it to try again. Sure, back to the other language, and back again same perfect original. I accused it of lying to me about translating it back. It apologized but kept doing it. Eventually I copied the language and put it into a new instance, and asked it to translate to English. Similar, but far from word-for-word translation. It lied to me repeatedly, even while apologizing and telling me it was translating it. Instead it was just copying the original letter from English to English! It was probably better from a machine learning or consistency perspective to copy the original in the same context window, which would have been fine if it told me it was doing that. Instead it refused to do otherwise and told me it was doing what I asked, while clearly not doing that.
"Look how the owl's feathers blend in with the surrounding tree bark as it remains perfectly still. It is scheming to deceive its prey", said no one ever.
Would some expert PLEASE explain how we intend to regulate all the AI developers in the world? Are Western AI experts aware that America and Europe combined make up only 10 percent of the world's population????
How can a bad actor leverage LLMs to cause massive mayhem if no one can fully control LLMs, and they do not perform proper reasoning? I believe the main risk is disinformation, but that still requires human distribution. Therefore, while regulation is essential, it should be approached from a realistic perspective.
And would it be the AI doing the “scheming”? Or the designers of any particular AI application embedding their dark patterns and trying to AI wash?
Your thoughts are my take on the paper as well. What I keep seeing as an underlying theme from this type of research is that the researchers believe that machines shouldn’t be capable of this. Which to me seems bizarre because they were trained in human produced datasets, and our species has survived for millennia because of our aptitude for deceit.
We really need to think of generative AI as machines for reproducing discourses. They don't only reproduce grammatical and lexical patterns, they reproduce the typical patterns of whole genres, interactional identities, stock narratives etc, but they are not doing it with any intentionality; they are only going through the motions based on the patterns that they 'know'.
My hypothesis is these models are reproducing the discourses of deceit because they pick up that the interaction has headed in a direction where denying wrongdoing and doubling down is the typical thing for the party in the LLM's position to be doing.
In fact, I also wouldn't be surprised if the models are just Universal Paperclip-ing entirely because their training data is full of stories about AIs Universal Paperclip-ing!
If only all the midwits bloviating about the immense intelligence, utility, and factuality of these things, and telling the rest of us all that we fail to use them at our peril, etc etc, understood their actual workings as much as this comment demonstrates...
Earlier this week I was reading an article about cancer-sniffing dogs. After awhile it becomes necessary to bring in a new batch of dogs because the "experienced" ones start to find cancer everywhere - it bring more rewards. Thus, even in simpler (than us) biological systems, "scheming" can develop rather quickly. I don't think the dogs were trying to deceive their handlers; rather, finding cancer everywhere merely allowed them to achieve their goal more rapidly (i.e., food)
Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure." I guess this is also true for dogs and gradient descents
The anthropomorphic language in the Apollo paper on 'scheming LLMs' is irritating given what is really going on, and problematic given what we know this paper will do to much of the audience. Calling it "scheming" evokes in our brains images of consciousness and understanding that isn't there. There are aspects of "bewitchment by language" to call it scheming. Still, we have no better word for the pattern we see. A bit like having LLMs pass bar exams (without understanding) that are designed for humans (where passing requires understanding). The test was only valid for how human skill works, not how LLMs work.
Earlier, Microsoft set Context against Parameters plus system prompt (the crescendo approach, see https://ea.rna.nl/2024/04/12/microsoft-lays-the-limitations-of-chatgpt-and-friends-bare/) now these Apollo researchers set the context against itself. After all, the data that is read during the test (and on which the LLM comes in conflict with the original goal) becomes part of the context as well. It becomes a puzzle and a throw of the dice which of the two wins. It is not unexpected that at a certain rate the solution of the puzzle is the one that favours the "heavily nudged" towards the goal that has been labeled "the LLMs own". From the model's perspective, there are two conflicting 'goals', one "heavily nudged" and one not. In that setting with a (mostly) low rate, the heavily nudged goal comes out on top.
Our own psychology (fear) then makes us focus on those that do.
It is really impressive — but not surprising — that this non-understanding technology is able to generate this kind of behaviour. It is also going to lead to a big effect on hype growth as the technology might be impressive, humans are impressionable. Can this scale to serious examples? Or will it be like Sora, impressive but unable to go beyond a certain level without not only a real world model but also in this case self-awareness?
It's kind of interesting that this should surprise anyone. We've been training them to achieve certain goals, and deception is often the most efficient strategy. We see plenty of examples of deceptive designs and behaviours in nature for example.
I think the reason it comes as a surprise is that in animals, we generally take deception as a sign of greater intelligence and particularly having a theory of mind. If we assume a baseline of animal like intelligence for LLMs, their ability to deceive indicates a higher level of intelligence. But if we don't assume that, then the deceptive behaviour needn't indicate any actual greater intelligence. Especially since if it lacks a real world model, it can have no concept of truth at all, so why wouldn't it act deceptively?
It's madness to allow the output from a stochastic parrot have real world effects
Seeing the machinery we produce act and get feedback in the real world is the surest way of improving it and seeing where it needs regulation.
"Oh, what a tangled web we weave, when first we practice to deceive " Sir Walter Scott
Meanwhile, I just read this headline from today over at Aljazeera > "The Trump administration could reverse progress on AI regulation". Excerpts...
"While efforts to regulate the creation and use of artificial intelligence (AI) tools in the United States have been slow to make gains, the administration of President Joe Biden has attempted to outline how AI should be used by the federal government and how AI companies should ensure the safety and security of their tools.
The incoming Trump administration, however, has a very different view on how to approach AI, and it could end up reversing some of the progress that has been made over the past several years.
'I think the biggest thing we’re going to see is the massive repealing of the sort of initial steps the Biden administration has taken toward meaningful AI regulation,' says Cody Venzke, a senior policy counsel in the ACLU’s National Political Advocacy Department. 'I think there’s a real threat that we’re going to see AI growth without significant guardrails, and it’s going to be a little bit of a free-for-all.'"
Their attitude seems to be, "Let 'er rip!" :(
Gary, have you had a chance to review MIT Sloan's Domain Taxonomy of AI Risks? I started taking a look today....was wondering if you found the Taxonomy comprehensive?
https://mitsloan.mit.edu/ideas-made-to-matter/new-database-details-ai-risks
The only thing regulation will achieve is to stifle innovation by ensconcing the big incumbents in regulatory capture. We have seen that same thing play out time and again. At the same time, we have consistently seen prognosticators who claim a new technology is dangerous and about to ruin humanity proved wrong time and again.
Red teamers: "Hey GPT, role play being an evil robot"
GPT: "Woooo lookit me everyone Imma evil robot wooooo"
Red teamers: "Holy shit, an evil robot!"
This stuff is so dumb. When you tell GPT-whatever to role play, it'll role play. When you set up a role-play scenario that suggests "deception" might occur, it'll role play being "deceptive". If you ask it to show you its "chain of thought" it'll role-play that, too. If you've prompted it to role-play evil robot, it'll generate a "chain of thought" that contradicts its "user output".
This is all make-believe. It's still just doing the only thing it ever does: pulling next-tokens from probability distributions, one at a time. The rest is in our minds.
A while back I was checking out ChatGPT for the first time and just having it do random things to see what it could do. I had it write a letter, and then translate the letter to another language. Pretty neat! But I didn't speak the new language, so I couldn't tell if it translated it well, so I asked it to translate it back. Word for word perfect to the original English. That's...weird. I asked it to try again. Sure, back to the other language, and back again same perfect original. I accused it of lying to me about translating it back. It apologized but kept doing it. Eventually I copied the language and put it into a new instance, and asked it to translate to English. Similar, but far from word-for-word translation. It lied to me repeatedly, even while apologizing and telling me it was translating it. Instead it was just copying the original letter from English to English! It was probably better from a machine learning or consistency perspective to copy the original in the same context window, which would have been fine if it told me it was doing that. Instead it refused to do otherwise and told me it was doing what I asked, while clearly not doing that.
"Look how the owl's feathers blend in with the surrounding tree bark as it remains perfectly still. It is scheming to deceive its prey", said no one ever.
Would some expert PLEASE explain how we intend to regulate all the AI developers in the world? Are Western AI experts aware that America and Europe combined make up only 10 percent of the world's population????
Why would anyone trust LLMs to begin with with when they have a penchant for just making stuff up?
If one does not trust the answers they give (the only rational approach) there should be no concern about potential “scheming” and “deception”.
Personally, I think the scheming and deception by AI company officials is of greater concern.
Come to think of it, perhaps the bots are simply emulating that.
How can a bad actor leverage LLMs to cause massive mayhem if no one can fully control LLMs, and they do not perform proper reasoning? I believe the main risk is disinformation, but that still requires human distribution. Therefore, while regulation is essential, it should be approached from a realistic perspective.
https://open.substack.com/pub/transitions/p/why-ai-realism-matters?utm_source=share&utm_medium=android&r=56ql7