Yesterday Sam Altman claimed in a new blog post that “We are now confident we how to build AGI as we have traditionally understood it”, alleging that OpenAI (which has not demonstrated AGI) was now on to new and better things.
I do not share his confidence. Here is a thread I wrote about it last night on X, that quickly became viral:
§
An MIT professor just asked me, how can you be so confident that OpenAI isn’t very close to AGI?
Here is a thread of links to a several recent observations that I think show how far we are from robust intelligence.
Importantly, all of them are *known* problems, many that I have noted repeatedly for years and even decades.
There is no *principled* solution thus far to any of them.
The problem of distribution shift, well-tested in Apple’s 2024 reasoning paper, which goes back to my own 1998 work, continues.
LLMs, like their predecessors generalize well to similar items but struggle with reliability in unfamiliar regimes, even minor variations sometimes break them.
A recent paper on Putnam math problems showed a similar failure on distribution shift, showing 30% decrement by o1 on math problems with minor variations eg on variable names.
Same old story, different set of problems.
Commonsense reasoning is still dodgy, as it has long been, as Ernie Davis and I just reviewed here..
o1 probably isn’t as robust as people might have hoped. Even on certain benchmarks, results may be fragile or difficult to replicate:
In o1’s best performances cases, there may have been heavy data augmentation that is not possible in more open-ended domains. In domains where that’s not possible, o1 may not even be that much better than GPT-4 (e.g on some language tasks), as you can see if you read OpenAI’s reports carefully.
The presentation of o3 was problematic, and no clean test of out-of-domain generalization was provided, suggesting that we may see the kinds of problems with distribution shift as ever, outside of semi-closed domains where synthetic data can readily be generated.
Many leading figures in the field have acknowledged that we may have reached a period of diminishing returns of pure LLM scaling, much as I anticipated in 2022. It’s anybody’s guess what happens next.
[Note also that the current idea, test-time “scaling”, is very expensive, and will likely prove unreliable out of domains where augmentation is adequate, in which case it will get comparatively little commercial traction. Even o1 is losing money; o3 isn’t even available, and perhaps cannot be run at a profit.]
As I noted in 2001, the lack of distinct, accessible, reliable database-style records leads to hallucinations. Despite many promises to the contrary, this still routinely leads to inaccurate news summaries, defamation, fictional sources, incorrect advice, and unreliability.
Does all this mean that I am absolutely 100% certain that AGI is not in clear sight? No; I am a scientist; if new evidence comes in, I will look at it, as I always have. But so far I have not seen a principled solution to even one of these longstanding problems, and that tells me a lot.
Gary Marcus recently wrote about his track record of accurate predictions. To the extent that those predictions have largely been on-target, for years, it is mostly because he has continued to focus on the implications of the 8 problems above.
It doesn't matter what the actual evidence is. Sam Altman (and OpenAI) have painted themselves into a corner. They've chosen the LLM route; they're committed to it. They've been telling their investors for years that they're on the path to AGI; there's no way out for them now, no way back, no possible way that they can now say "ah, well, actually, it turns out we were wrong". And so, whatever the evidence is, whatever the performance of their next N models, they're going to spin it into something optimistic, something positive, something that says "stick with us, we're getting there, every year will be even more amazing than the last" etc. They have no other realistic choice.
Altman could have rather said, « AGI is essentially a solved problem since we can solve benchmark problem one by one, with a few months in between each. »