The Road to AI We Can Trust

Share this post
Does AI really need a paradigm shift?
garymarcus.substack.com

Does AI really need a paradigm shift?

Probably so. A response to Scott Alexander’s essay, “Somewhat Contra Marcus On AI Scaling”

Gary Marcus
Jun 11
30
23
Share this post
Does AI really need a paradigm shift?
garymarcus.substack.com

The provocative public conversation I am having with Scott Alexander, of SlateStarCodex fame, already a meme, continues!

In a fresh reply to my “What does it mean when AI fails?”, Alexander has put forward his second stimulating critique of the week, “Somewhat Contra Marcus On AI Scaling”.

Which is not to say it’s perfect; on the other hand, one doesn’t need perfection in order to be provocative. I will respond briefly to the disappointing part, and then we then get to the good stuff.

Strawman, Steelman

In general, Alexander is known for being exceptionally fair to ideas he doesn’t particularly like, even if at times that makes him unpopular. An entry on Quora perfectly distills this laudable aspiration:

“[Scott Alexander] optimizes for enlightenment, rather than for winning arguments. Thus, when considering an issue, he will actively seek out and often formulate the strongest arguments (that is, steelman) [for] both sides.”

I only wish he had extended the same courtesy to my position.

To take one example, Alexander puts words like prove or proven in my mouth:

“Marcus says GPT’s failures prove that purely statistical AI is a dead end”

and

GPT certainly hasn’t yet proven that statistical AI can do everything the brain does. But it hasn’t proven the opposite, either [as if Marcus said that it had].

But that’s a strawman. In reality I would never say that I have proven anything; what I do as a scientist is to weigh evidence and suggest research directions. I say that we have given the scaling hypothesis a really good look (with a larger budget than all but a handful of projects in history), and as such the failures of large scale systems are evidence (not proof) that we ought to seriously consider alternatives, e.g., here:

Rather than supporting the Lockean, blank-slate view, GPT-2 appears to be an accidental counter-evidence to that view […]

GPT-2 is both a triumph for empiricism, and, in light of the massive resources of data and computation that have been poured into them, a clear sign that it is time to consider investing in different approaches.

In making it sound like I have declared proof when I have never made such declarations, Alexander paints me as an extremist, rather than a scientist who weighs evidence and uncertainty. Where has his steelperson aspiration gone?1

Another rhetorical trick is to paint me as a lone, lunatic voice, as if I were the only person doubting that scaling will get us to AGI (whatever that is) when in fact there are loads of people with similar concerns.

Melanie Mitchell, for example, has repeatedly emphasized the importance of representing meaning, above and beyond anything that we currently know GPT-3 to do. Emily Bender, Margaret Mitchell and Timnit Gebru have derided Large language models as stochastic parrots. (Not entirely fair to the parrots, but you get the idea.)

Rising star Abebe Birhane has written a withering criticism of the ways in which LLMs rely of objectionable data scraped from the internet. Ernie Davis, I mentioned last time; almost all of our joint rejoinders draw on his deep work on common sense. Judea Pearl has shouted over and over that we need deeper understanding and written a whole book about the importance of causality and how it is missing from current models. Meredith Broussard and Kate Crawford have written recent books sharply critical of current AI. Meta researcher Dieuwke Hupkes has been exposing limits in the abilities of current LLM’s to generalize.

As an important aside, a lot of those other voices are women, and while I am certainly flattered by all the attention Alexander has been giving me lately, it’s not a good look to make this whole discussion sound like one more white guy-on-white guy debate when (a) so many strong female (and minority) voices have participated, and (b) so many of the unfortunate consequences of a webscraping/big data approach to AI are disproportionately borne by women and minorities.

In inaccurately portraying me as a lone crusader, Alexander has not given the scaling is not enough for AGI view the “steelman” treatment that he is known for delivering.

Phew. Now for the good part!

What’s the chance that AI needs a paradigm shift?

Bets are evidently in the air. I bet Elon Musk $100,000 that we wouldn’t have AGI by 2029 (no reply), and in similar vein tried to get Alexander to go in on a sucker bet about the capabilities of GPT-4. Alexander wisely declined, but countered with five bets of his own:

On the first, we are basically in agreement. I in no way doubt that there is at least a little bit more headroom left for large language models. The real controversy is whether that’s enough.

On the second, the notion of “deep learning based model” is too vague; it might apply to a pure deep-learning model, but e.g., also to any kind of neurosymbolic hybrid in which deep learning was just one of a dozen mechanisms. It’s just not clear that anything serious is excluded.

There is also some softness in the formulation of the third bet, where the key word is “descendant”. If, for example, The World’s First Successful AGI was a 50:50 hybrid of large language models and something symbolic like CYC, it might overall look relatively little like GPT-3, but its champions might still be tempted to declare victory. At the same time, I would (rightly) be entitled to declare moral victory for neurosymbolic AI. Both symbols and LLMs could note their genes in the grandchild. Hybrid vigor for the win!

But then things get interesting.

Paradigm shift (raised in 4 and 5) is exactly what this whole discussion is really about. Thank Kuhn, Alexander was brave enough to say it out loud.

What we all really want to know, as a global research community, is are we approaching things properly right now, or should we shift in some way?

Personally I’d put the probability that we need to shift at 90%, well above the 60% that Alexander suggests, and I would put the probability that we will need to embrace symbol-manipulation as part of the mix at 80%, more than double Alexander’s 34%. Others may put the probability on some kind of paradigm shift (perhaps not yet known) even higher. Just yesterday Stanford PhD student Andrey Kurenkov put the probability at nearly 100%, based on arguments he gave last year about GPT-3 lacking external memory:

Twitter avatar for @andrey_kurenkovAndrey Kurenkov 🇺🇦 @andrey_kurenkov
It's pretty obvious that 'simply scaling' a GPT-style LLM will not lead to AGI by virtue of the inherent limits of the model architecture and training paradigm.

June 10th 2022

1 Retweet7 Likes

A day or two earlier, the most important empirical article of the week dropped: Big Bench [link], a massive investigation of massive language models that has a massive list of 442 authors. Large language models; even larger papers! (How massive is the author list? I sent the paper to Scott Aaronson, who said he would read it on the plane; half an hour later he writes back: “I've been reading this paper for the past 10 minutes but haven't yet made it past the author list.”)

The basic finding was: loads of things scale, but not at all ( as foreseen both by Kurenkov in his essay last year and in my much lampooned but basically accurate essay, Deep learning is hitting a wall). Every one of the 442 authors signed off on a paper that contains a conclusion statement that I excerpt here:

The massive paper looks at scaling on many measures, and sees progress on some– but not others. The stuff that isn’t scaling is the “wall”.

The emphasis here is of course is on the words “will require new approaches, rather than scale alone.” That’s why we need new approaches, exactly as I have been arguing.

Meanwhile, if you were to believe what you read on Twitter, you’d think that my biggest rival in the universe is Yann LeCun. While there’s no denying that the two of us have often clashed, on this point—the point about scale alone not likely to be enough, and about the need for some new discovery (i.e., paradigm shifts)—we are actually in complete agreement.

For example, LeCun recently posted this sequence of tweets (excerpted from a long, excellent thread that I discuss here):

Twitter avatar for @ylecunYann LeCun @ylecun
(1) the research community is making *some* progress towards HLAI (2) scaling up helps. It's necessary but not sufficient, because.... (3) we are still missing some fundamental concepts 2/N

May 17th 2022

14 Retweets407 Likes
Twitter avatar for @ylecunYann LeCun @ylecun
(4) some of those new concepts are possibly "around the corner" (e.g. generalized self-supervised learning) (5) but we don't know how many such new concepts are needed. We just see the most obvious ones. (6) hence, we can't predict how long it's going to take to reach HLAI. 3/N

May 17th 2022

17 Retweets369 Likes

Amen.

§

Ok, now for the hard part: what we should count as a paradigm shift?

The best that I have seen on that is a thread from a deep learning/NLP postdoc at Edinburgh, Antoni Valerio Miceli-Barone, who asked the field for some tangible, falsifiable predictions. When he invited me into the thread, I made some predictions, tied not to time but to architecture:

Twitter avatar for @GaryMarcusGary Marcus 🇺🇦 @GaryMarcus
@AVMiceliBarone @FelixHill84 predictions: whenever AGI comes: 👉large-scale symbolic knowledge will be crucial 👉explicit cognitive models will be crucial 👉operations over variables (including storing, retrieving and comparing values) will be crucial 👉an explicit type/token distinction will be crucial

May 17th 2022

6 Retweets23 Likes

(That’s basically what Alexander’s #4 is about)

Barone however insisted on something more; I asked him to define his terms; in a short tweet he characterized what we might count as the current regime:

Twitter avatar for @AVMiceliBaroneAntonio Valerio Miceli Barone @AVMiceliBarone
@GaryMarcus @FelixHill84 Let's say CNNs+RNNs+Transformers, no writable latent discrete memory. I'd consider "discrete attractors" models (e.g. capsules, slot attention, clustering) as innovations, since while they already exist to some extent they are not applied at scale.

May 17th 2022

2 Likes

“Paradigm shift” then becomes operationally defined as anything not in Miceli-Barone’s first sentence. E.g., if the field were to turn from simply scaling (GPT-3 is pretty much just GPT-2 but bigger) to using large language models as only one component in a larger architecture, with things like writable/readable discrete memory for symbolic propositions, I certainly think we should view that as a paradigm shift.

I, as long term neurosymbolic advocate [link], would of course feel particularly vindicated if that shift was about building bridges to traditional symbolic tools (like storing, retrieving, and comparing propositions) from long-term memory, as it is with the explicitly neurosymbolic MRKL paper from AI21 that I mentioned a few days ago,

That said, I, of course, could be right about foreseeing the need for a paradigm shift, but wrong about what that paradigm shift turns out to be.

§

LeCun’s May 17 thread lays out some of the many challenges ahead, any one of which, on its own, might in fact demand an innovation radical enough to count as a paradigm shift, neurosymbolic or otherwise:

Twitter avatar for @ylecunYann LeCun @ylecun
I believe we need to find new concepts that would allow machines to: - learn how the world works by observing like babies. - learn to predict how one can influence the world through taking actions. 6/N

May 17th 2022

47 Retweets598 Likes
Twitter avatar for @ylecunYann LeCun @ylecun
- learn hierarchical representations that allow long-term predictions in abstract spaces. - properly deal with the fact that the world is not completely predictable. - enable agents to predict the effects of sequences of actions so as to be able to reason & plan 7/N

May 17th 2022

22 Retweets439 Likes
Twitter avatar for @ylecunYann LeCun @ylecun
- enable machines to plan hierarchically, decomposing a complex task into subtasks. - all of this in ways that are compatible with gradient-based learning. The solution is not just around the corner. We have a number of obstacles to clear, and we don't know how. 8/N

May 17th 2022

21 Retweets440 Likes

To all this I would add the capacity to build, interrogate, and reason about long-term cognitive models of an ever-changing world [link next decade], stored in some kind of long-term memory that allows for trustworthy storage and retrieval.

So: can we get to AGI and reliable, no-goof reasoning, handling all the challenges that LeCun and I have been discussing, with scaling and CNNs + RNNs + Transformers alone?

In my view, no way; I think the actual odds are less than 20%.

So Scott, challenge accepted, I am on for your bets #4 and #5.

§

The other Scott, Aaronson said, yesterday, “From where I stand, though, the single most important thing you could do in your reply is to give examples of tasks or benchmarks where, not only does GPT-3 do poorly, but you predict that GPT-10 will do poorly, if no new ideas are added”

As it happens that Miceli-Barone has posted precisely the same question [link] in May; here’s what I said then, and stand by:

Twitter avatar for @GaryMarcusGary Marcus 🇺🇦 @GaryMarcus
@AVMiceliBarone @FelixHill84 I think pure deep learning so defined will fail at Comprehension Challenge, proposed here:
newyorker.com/tech/annals-of… and developed a bit further here: ojs.aaai.org//index.php/aim…. Working w some folks to try to implement for real.

May 17th 2022

2 Likes

If GPT-10 is just like GPT-3, no extra bells and whistles, just bigger, and able to read whole novels and watch whole movies and answer subtle and open-ended questions about characters and their conflicts and motivation, and tell us when to laugh, I promise to post a YouTube video admitting I was wrong.

§

I give LeCun the last word, once again from his May 17th manifesto:

Twitter avatar for @ylecunYann LeCun @ylecun
I really don't think it's just a matter of scaling things up. We still don't have a learning paradigm that allows machines to learn how the world works, like human anspd many non-human babies do. 4/N

May 17th 2022

50 Retweets499 Likes

- Gary Marcus

1

Something similar happens when Alexander sets up what AGI is. There are two extreme views of AGI that one might imagine: one (easily defeated) in which the whole point is to mimic humans exactly, and the other (the steelman) in which AGI tries in some respects in which humans are particularly flawed to do better than humans. For example, humans are lousy at arithmetic. But you shouldn’t be able to declare “I have made an AGI” by showing you can build a machine that makes arithmetic errors. Alexander’s long digression on Luria’s famous cross cultural reasoning work on untutored subjects applies only to the strawman, not the steelman.

Of course, as Miceli-Barone pointed out to me, one could ask whether a particular kind of error might reveal something about whether a given model was a good model of human cognition; that’s actually what my dissertation work with Steven Pinker was about. That’s a story for another day :)

23
Share this post
Does AI really need a paradigm shift?
garymarcus.substack.com
23 Comments

Create your profile

0 subscriptions will be displayed on your profile (edit)

Skip for now

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

derifatives
Jun 12

Do you consider works like "Memorizing Transformers" (https://arxiv.org/abs/2203.08913), which augments transformers with an explicit external memory allowing much longer contexts, to already represent a "paradigm shift?"

Expand full comment
ReplyCollapse
3 replies by Gary Marcus and others
Alexander Naumenko
Jul 1

From the height of my ivory tower where I am slowly going artificially wise, I can say that "explicit cognitive models" may not be so crucial. What I mean by that is that we already possess a powerful tool for modeling - language. We know how to figure out who "they" are if one word changes, we know what makes a journey a good metaphor for love, we can compare 6 feet to 180 cm, we know if a joke is funny or that it was funny the first time we heard it, we know how to teach all that to a machine, stop, do we? One way of looking at neurons is as if those are summators, the other way is that those are <key,value> storages. Will it constitute a paradigm shift?

Expand full comment
ReplyCollapse
21 more comments…
TopNewCommunity

No posts

Ready for more?

© 2022 Gary Marcus
Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing