In May, in a tweet that gave rise to this very Substack, DeepMind executive Nando de Freitas declared AGI victory, possibly prematurely, shouting “It’s all about scale now! The Game is Over!”:
de Freitas was arguing that AI doesn’t need a paradigm shift; it just needs more data, more efficiencies, bigger servers. I called this hypothesis—that AGI might arise from larger scale without fundamental new innovation— “scaling-über-alles”. I pointed out many problems; de Freitas never replied.
His hypothesis, now generally called scaling maximalism, remains extremely popular, in no small part because bigger and bigger models have indeed continued to do ever more impressive things.
So far.
The trouble of course is that months or even years of going up and up on some measures still does not in fact remotely entail that scale is all we need. Ponzi schemes go up and up until they explode. Scaling is an empirical observation, not a guaranteed, continued law of nature.
This week I saw not one but three striking premonitions for how the scaling maximalism hypothesis might end.
There might not be enough data in the world to make scaling maximalism work. A bunch of people have already worried about this. This week saw a formal proof by William Merrill, Alex Warstadt, and Tal Linzen arguing that “current neural LMs are not well suited” to extracting natural language semantics “without an infeasible amount of data”. The proof makes too many assumptions to be taken as gospel, but if it is even close to correct, there may soon be real trouble in Scaling City.
There might not be enough available compute in the world to make scaling maximalism feasible. Also this very week, Miguel Solano sent me a manuscript, (to which I am now contributing, along with Maria Elena Solano) that argues that scaling the current meta benchmark du jour, Big Bench, would require just over one-fourth of the U.S.’s entire electricity consumption in 2022.
Some important tasks might simply not scale. The most vivid illustration of this is a linguistics task by Ruis, Khan, Biderman, Hooker, Rocktäschl, and Grefenstette examining the pragmatic implications of language (e.g., quoting from their paper, “we intuitively understand the response “I wore gloves” to the question “Did you leave fingerprints?” as meaning “No.”). As I have long argued, capturing this without cognitive models and common sense is really hard. Scaling here was largely AWOL; even the best model was only at 80.6%, and for most models, scaling had at best a neglible effect. As the lead author, Laura Ruis, has pointed out to me, a more complex version of the task can easily be imagined; performance there would be presumably even lower. What hit me hard, as I was reading the paper, is that asymptotic 80% performance on even a single important task like this might spell game over for scaling. If you get syntax and semantics but fail on pragmatics or commonsense reasoning, you don’t have AGI you can trust.
Moore’s Law didn’t carry us as far and as fast as some people initially hoped, because it is not actually a causal law of the universe. Scaling maximalism is an interesting hypothesis, but I stand by my prediction that it won’t get us to AGI. This week rendered vivid three possible failure modes. Any one of them would mean we need a real paradigm shift, if we are to get to AGI.
All of this points to the same conclusion: if we want to reach AGI we shouldn’t keep putting so many eggs in the scaling-über-alles basket.
Indeed. It seems that scaling maximalism relies on the ambiguity of terms like 'big' and 'more'. Training sets on e.g. language in deep learning are very big compared to what humans use in learning language. But they are still minute compared to the 'performance' set of human language, which is in de order of 10^20 or more.
It would take about 10 billion people (agents) and 300 years, with 1 sentence produced and recorded every second, to get a training set of this size. It's fair say we are not there yet.
Also, even if we had a substantial subset, it would most likely be unevenly distributed. Maybe a lot about today’s weather but not very much about galaxies far far away (or perhaps the other way around). So, even with a set of this size it is not guaranteed that it would be statistically distributed sufficiently to cover all relations found in the performance set.
Deep learning is sometimes very impressive, and it could provide the backbone of a semantic system for AGI. But e.g. the fact that humans do not use training sets of the size of deep learning to learn language strongly suggests that the boundary conditions needed to achieve human-level cognition, and with it the underlying architecture, are fundamentally different from those underlying deep learning (e.g. see https://arxiv.org/abs/2210.10543).
Let's see...
After seventy years we still have not the slightest clue how to make ourselves safe from the first existential scale technology, nuclear weapons. And so, based on that experience, because we are brilliant, we decided to create another existential scale technology, AI, which we also have no idea how to make safe. And then Jennifer Doudna comes along and says, let's make genetic engineering as easy, cheap and accessible to as many people as possible as fast as possible, because we have no idea how to make that safe either.
It's a bizarre experience watching this unfold. All these very intelligent, highly educated, accomplished articulate experts celebrating their wild leap in to civilization threatening irrationality. The plan seems to be to create ever more, ever larger existential threat technologies at an ever accelerating rate, to discover what happens. As if simple common sense couldn't predict that already.
Ok, I'm done, and off to watch Don't Look Up again.