Humanity’s “Oh shit!” AI moment?
Not yet, but it could come sooner than you think. Not because we are close to AGI, but because we already have machines that can say one thing and do something else altogether.
Just read a new paper that you should really know about:
Despite being based on a what I consider to be a conceptual fallacy, Apolo Research’s new paper “Frontier models are capable of in-context scheme” is a smart and truly wild — and disconcerting. (OpenAI reports the same work in their o1 system card.)
On a quick first read, several things strike me.
• I wouldn’t take seriously any of its anthropomorphic language about LLMs “knowing” or having scheming or having intentions to deceive, etc. Take all that language with a grain of salt, but don’t give up on the paper.
• The important thing is that paper makes it very clear that nobody should ever take LLMs at their word. They can easily tell you one thing and (especially if hooked up as agents) do another — possibly quite contrary to what they have alleged they are doing. The paper is filled with disconcerting examples.
• Yes, LLMs can do all this even if they are essentially just analogizing regurgitation machines with no real world model. They have lots of examples in the database — and little capacity to compute the internal consistency of their own actions. Worse, they do not actually calculate the consequences of their actions at all. They can’t reliably sanity check themselves for hallucinations, and they can’t reliably sanity check for whether what they are doing is harmful.
• They more power we give them (but hooking up them to agents, people, etc), the more risk we are incurring.
• Even if machines don’t ever rise up a la the Terminator, a bad actor could still leverage LLMs to cause massive mayhem. Possibly soon.
• Expect a lot of GenAI-driven cybercrime in the coming years.
• Congress achieved little, and anti-regulation AI folks like Andreeseen are having a good run. That may leave all of us extremely vulnerable to this mess.
There’s a lot of anthropomorphism here in this paper that I find hard to take, but it is in other ways a smart and carefully considered paper — and ought to be a wakeup call.
The longer we delay in regulation, the more risk we are incurring.
Gary Marcus is the author of Taming Silicon Valley. He is not a “doomer”, and doesn’t think we will all die, but his p(chaos) just went up..
And would it be the AI doing the “scheming”? Or the designers of any particular AI application embedding their dark patterns and trying to AI wash?
Your thoughts are my take on the paper as well. What I keep seeing as an underlying theme from this type of research is that the researchers believe that machines shouldn’t be capable of this. Which to me seems bizarre because they were trained in human produced datasets, and our species has survived for millennia because of our aptitude for deceit.