Does AI really need a paradigm shift?
Probably so. A response to Scott Alexander’s essay, “Somewhat Contra Marcus On AI Scaling”
The provocative public conversation I am having with Scott Alexander, of SlateStarCodex fame, already a meme, continues!
In a fresh reply to my “What does it mean when AI fails?”, Alexander has put forward his second stimulating critique of the week, “Somewhat Contra Marcus On AI Scaling”.
Which is not to say it’s perfect; on the other hand, one doesn’t need perfection in order to be provocative. I will respond briefly to the disappointing part, and then we then get to the good stuff.
Strawman, Steelman
In general, Alexander is known for being exceptionally fair to ideas he doesn’t particularly like, even if at times that makes him unpopular. An entry on Quora perfectly distills this laudable aspiration:
“[Scott Alexander] optimizes for enlightenment, rather than for winning arguments. Thus, when considering an issue, he will actively seek out and often formulate the strongest arguments (that is, steelman) [for] both sides.”
I only wish he had extended the same courtesy to my position.
To take one example, Alexander puts words like prove or proven in my mouth:
“Marcus says GPT’s failures prove that purely statistical AI is a dead end”
and
GPT certainly hasn’t yet proven that statistical AI can do everything the brain does. But it hasn’t proven the opposite, either [as if Marcus said that it had].
But that’s a strawman. In reality I would never say that I have proven anything; what I do as a scientist is to weigh evidence and suggest research directions. I say that we have given the scaling hypothesis a really good look (with a larger budget than all but a handful of projects in history), and as such the failures of large scale systems are evidence (not proof) that we ought to seriously consider alternatives, e.g., here:
Rather than supporting the Lockean, blank-slate view, GPT-2 appears to be an accidental counter-evidence to that view […]
GPT-2 is both a triumph for empiricism, and, in light of the massive resources of data and computation that have been poured into them, a clear sign that it is time to consider investing in different approaches.
In making it sound like I have declared proof when I have never made such declarations, Alexander paints me as an extremist, rather than a scientist who weighs evidence and uncertainty. Where has his steelperson aspiration gone?1
Another rhetorical trick is to paint me as a lone, lunatic voice, as if I were the only person doubting that scaling will get us to AGI (whatever that is) when in fact there are loads of people with similar concerns.
Melanie Mitchell, for example, has repeatedly emphasized the importance of representing meaning, above and beyond anything that we currently know GPT-3 to do. Emily Bender, Margaret Mitchell and Timnit Gebru have derided Large language models as stochastic parrots. (Not entirely fair to the parrots, but you get the idea.)
Rising star Abebe Birhane has written a withering criticism of the ways in which LLMs rely of objectionable data scraped from the internet. Ernie Davis, I mentioned last time; almost all of our joint rejoinders draw on his deep work on common sense. Judea Pearl has shouted over and over that we need deeper understanding and written a whole book about the importance of causality and how it is missing from current models. Meredith Broussard and Kate Crawford have written recent books sharply critical of current AI. Meta researcher Dieuwke Hupkes has been exposing limits in the abilities of current LLM’s to generalize.
As an important aside, a lot of those other voices are women, and while I am certainly flattered by all the attention Alexander has been giving me lately, it’s not a good look to make this whole discussion sound like one more white guy-on-white guy debate when (a) so many strong female (and minority) voices have participated, and (b) so many of the unfortunate consequences of a webscraping/big data approach to AI are disproportionately borne by women and minorities.
In inaccurately portraying me as a lone crusader, Alexander has not given the scaling is not enough for AGI view the “steelman” treatment that he is known for delivering.
Phew. Now for the good part!
What’s the chance that AI needs a paradigm shift?
Bets are evidently in the air. I bet Elon Musk $100,000 that we wouldn’t have AGI by 2029 (no reply), and in similar vein tried to get Alexander to go in on a sucker bet about the capabilities of GPT-4. Alexander wisely declined, but countered with five bets of his own:
On the first, we are basically in agreement. I in no way doubt that there is at least a little bit more headroom left for large language models. The real controversy is whether that’s enough.
On the second, the notion of “deep learning based model” is too vague; it might apply to a pure deep-learning model, but e.g., also to any kind of neurosymbolic hybrid in which deep learning was just one of a dozen mechanisms. It’s just not clear that anything serious is excluded.
There is also some softness in the formulation of the third bet, where the key word is “descendant”. If, for example, The World’s First Successful AGI was a 50:50 hybrid of large language models and something symbolic like CYC, it might overall look relatively little like GPT-3, but its champions might still be tempted to declare victory. At the same time, I would (rightly) be entitled to declare moral victory for neurosymbolic AI. Both symbols and LLMs could note their genes in the grandchild. Hybrid vigor for the win!
But then things get interesting.
Paradigm shift (raised in 4 and 5) is exactly what this whole discussion is really about. Thank Kuhn, Alexander was brave enough to say it out loud.
What we all really want to know, as a global research community, is are we approaching things properly right now, or should we shift in some way?
Personally I’d put the probability that we need to shift at 90%, well above the 60% that Alexander suggests, and I would put the probability that we will need to embrace symbol-manipulation as part of the mix at 80%, more than double Alexander’s 34%. Others may put the probability on some kind of paradigm shift (perhaps not yet known) even higher. Just yesterday Stanford PhD student Andrey Kurenkov put the probability at nearly 100%, based on arguments he gave last year about GPT-3 lacking external memory:
A day or two earlier, the most important empirical article of the week dropped: Big Bench [link], a massive investigation of massive language models that has a massive list of 442 authors. Large language models; even larger papers! (How massive is the author list? I sent the paper to Scott Aaronson, who said he would read it on the plane; half an hour later he writes back: “I've been reading this paper for the past 10 minutes but haven't yet made it past the author list.”)
The basic finding was: loads of things scale, but not at all ( as foreseen both by Kurenkov in his essay last year and in my much lampooned but basically accurate essay, Deep learning is hitting a wall). Every one of the 442 authors signed off on a paper that contains a conclusion statement that I excerpt here:
The massive paper looks at scaling on many measures, and sees progress on some– but not others. The stuff that isn’t scaling is the “wall”.
The emphasis here is of course is on the words “will require new approaches, rather than scale alone.” That’s why we need new approaches, exactly as I have been arguing.
Meanwhile, if you were to believe what you read on Twitter, you’d think that my biggest rival in the universe is Yann LeCun. While there’s no denying that the two of us have often clashed, on this point—the point about scale alone not likely to be enough, and about the need for some new discovery (i.e., paradigm shifts)—we are actually in complete agreement.
For example, LeCun recently posted this sequence of tweets (excerpted from a long, excellent thread that I discuss here):
Amen.
§
Ok, now for the hard part: what we should count as a paradigm shift?
The best that I have seen on that is a thread from a deep learning/NLP postdoc at Edinburgh, Antoni Valerio Miceli-Barone, who asked the field for some tangible, falsifiable predictions. When he invited me into the thread, I made some predictions, tied not to time but to architecture:
(That’s basically what Alexander’s #4 is about)
Barone however insisted on something more; I asked him to define his terms; in a short tweet he characterized what we might count as the current regime:
“Paradigm shift” then becomes operationally defined as anything not in Miceli-Barone’s first sentence. E.g., if the field were to turn from simply scaling (GPT-3 is pretty much just GPT-2 but bigger) to using large language models as only one component in a larger architecture, with things like writable/readable discrete memory for symbolic propositions, I certainly think we should view that as a paradigm shift.
I, as long term neurosymbolic advocate [link], would of course feel particularly vindicated if that shift was about building bridges to traditional symbolic tools (like storing, retrieving, and comparing propositions) from long-term memory, as it is with the explicitly neurosymbolic MRKL paper from AI21 that I mentioned a few days ago,
That said, I, of course, could be right about foreseeing the need for a paradigm shift, but wrong about what that paradigm shift turns out to be.
§
LeCun’s May 17 thread lays out some of the many challenges ahead, any one of which, on its own, might in fact demand an innovation radical enough to count as a paradigm shift, neurosymbolic or otherwise:
To all this I would add the capacity to build, interrogate, and reason about long-term cognitive models of an ever-changing world [link next decade], stored in some kind of long-term memory that allows for trustworthy storage and retrieval.
So: can we get to AGI and reliable, no-goof reasoning, handling all the challenges that LeCun and I have been discussing, with scaling and CNNs + RNNs + Transformers alone?
In my view, no way; I think the actual odds are less than 20%.
So Scott, challenge accepted, I am on for your bets #4 and #5.
§
The other Scott, Aaronson said, yesterday, “From where I stand, though, the single most important thing you could do in your reply is to give examples of tasks or benchmarks where, not only does GPT-3 do poorly, but you predict that GPT-10 will do poorly, if no new ideas are added”
As it happens that Miceli-Barone has posted precisely the same question [link] in May; here’s what I said then, and stand by:
If GPT-10 is just like GPT-3, no extra bells and whistles, just bigger, and able to read whole novels and watch whole movies and answer subtle and open-ended questions about characters and their conflicts and motivation, and tell us when to laugh, I promise to post a YouTube video admitting I was wrong.
§
I give LeCun the last word, once again from his May 17th manifesto:
- Gary Marcus
Something similar happens when Alexander sets up what AGI is. There are two extreme views of AGI that one might imagine: one (easily defeated) in which the whole point is to mimic humans exactly, and the other (the steelman) in which AGI tries in some respects in which humans are particularly flawed to do better than humans. For example, humans are lousy at arithmetic. But you shouldn’t be able to declare “I have made an AGI” by showing you can build a machine that makes arithmetic errors. Alexander’s long digression on Luria’s famous cross cultural reasoning work on untutored subjects applies only to the strawman, not the steelman.
Of course, as Miceli-Barone pointed out to me, one could ask whether a particular kind of error might reveal something about whether a given model was a good model of human cognition; that’s actually what my dissertation work with Steven Pinker was about. That’s a story for another day :)
Do you consider works like "Memorizing Transformers" (https://arxiv.org/abs/2203.08913), which augments transformers with an explicit external memory allowing much longer contexts, to already represent a "paradigm shift?"
'Rising star Abebe Birhane has written a withering criticism of the ways in which LLMs rely of objectionable data scraped from the internet. '
This, however important in itself, is basically irrelevant to whether 'scaling will get us to AGI' no? You could have a very intelligent AI that had racist and sexist biases, just like some distinguished scientists have also been racist cranks. If scaling got us that it would be *bad*, but it would still mean Alexander was more right about the potential of just scaling stuff up than you are.
Trying to make debates on fairly specific claims about AI, like the one about whether scaling will lead to human-level-or-better performance on all or most tasks, really about whether you are generically pro/anti AI (or worse still tech as industry), strikes me as a recipe for confusion and polarization.