Marcus on AI

I find the initial statement strange and a tell of sorts. What would "doubling of capabilities" really even mean? Will LLMs double their score on some test? Will they do twice as many things in a given time period? Consume half the power for a given task? It all sounds like BS coming right out of the gate.

Expand full comment

Reply (4)

Paul Jurczak

Doubling of capabilities means cooking the planet twice as fast or draining investors pockets twice as fast.

Expand full comment

Richard Self

Let's go for emptying pockets! VCs might learn something one day.

Expand full comment

Joy in HK

Apr 13Edited

I wondered that as well. If you are at 85% accuracy, it's not possible to double that. Perhaps you could double the areas of interest in which you can get 85%+ accuracy, but other than that, as Gary points out 100% doesn't have any room past the top.

Expand full comment

Scott Burson

I think a reasonable transform to apply would be the odds ratio, p / (1 - p), where p is in this case the probability of a correct answer. To use some sample numbers from Gary's post, 54% correct gives on odds ratio of ~1.17; 85% gives ~5.7, which is more than 4x better; but then 87% gives ~6.7, which is only a silght further improvement. But if we got to, say, 93%, the odds ratio would be ~13.3 -- almost another doubling. I think it would be fair to say that 93% is about twice as good as 87%.

Expand full comment

Matt Hawthorn

Yeah, I was thinking of error rate reduction, which could go on halving forever, very similarly to odds ratio doubling - basically the dynamics you would expect from logistic growth which rises exponentially initially but then turns to inverse-exponentially approaching an asymptote as resource constraints set in. But still, I don't think the hype guy who said the vague phrase "doubling of capabilities" had any such rigorous definition in mind.

Expand full comment

Samuel Braun

Apr 27

Simply invent new harder tests that "old" Ais will have a low score on like 40% for GPT 4 and then the new model gets 80% or something.

Expand full comment

Douglas Renwick

Apr 15Edited

Given Moore's Law they will consume half as much power every 18 months.

Given Jevon's paradox, they will consume more power every 18 months.

Expand full comment

Herbert Roitblat

You recently wrote about a paper that I think better makes the case you are raising here. https://arxiv.org/abs/2404.04125. Exponential increases in data are needed for linear improvements in models.

Expand full comment

Gerben Wierda

May 16

I was trying to find that post, but funny enough I cannot find it). Gary?

Expand full comment

Alex

You might define the measure of capabilities of an LLM to be the number of tasks that it solves correctly. It is finite due to the length restriction of the LLM input ("context"). Many different input texts can describe the same task; hence each task is a set of essentially equivalent texts.

Of course there is no linear effect of capacity increase on benchmark scores. However given the purpose of LLM benchmarks, the relationship should be close to monotonic. Hence the approach taken by Gary Marcus in this article to show lack of recent LLM capability improvements seems valid.

Expand full comment

what does “compute” matter when we’ve run out of training data, and the models have the same theoretical holes, such as not being able to distinguish correlation from causation, that they have always have?

Expand full comment

Simon Au-Yong

Apr 13Edited

The silence from the major vendors about diminishing returns is deafening (I would have expected them to come out swinging) 🦗.

Does this mean that they are implicitly in agreement?

Expand full comment

Dr Cristian Ispir

Interesting. Following closely.

Expand full comment

Andy Loats

Wouldn’t you expect that companies that are flogging these will just start exploring other use cases until something gives them a return instead of throwing everything at maintaining their lead (if there is such a thing)?

It seems to me that the sunk costs of development and operations will necessitate these corporations to double and triple down, lest they fall behind.

Once you step off the exponential escalator, it’s pretty difficult to catch up don’t you think?

Expand full comment

Digitaurus

Thank you for an excellent article. I agree that the underlying GPT-based LLMs are probably reaching a point of diminishing returns but I would suggest LLMs will show big improvements in future, just not using the GPT architecture which learns very slowly indeed. It will take some time for new architectures to be developed and for researchers to get a better theoretical understanding of how these structuralist models work. I would give it 5-10 years before the next step change in performance.

The implications are certainly for a collapse in OpenAI's mega-unicorn valuation which was basically only ever notional. (As a former VC myself, I am assuming the latest VCs got in more as a marketing and access-to-deals type step rather than a genuine expectation of VC-sized upside.) But that was always on the cards.

None of this has particular implications for AGI which remains a distant dream/nightmare*.

*Delete for preference

Expand full comment

Mike Lambert

https://medium.com/ai-advances/think-before-you-speak-5611bcbbbd4c

This might be a bit obvious, but I think “GPT 4 Turbo” was supposed to be a fast version of “GPT 4”. They’re using numbers to indicate the quality of the base model, and so it’s no surprise GPT4T doesn’t show much of an improvement….

If you want to make a claim that performance is saturating on increasingly larger models, it feels like you’d need to make a claim about how GPT5 would show on the unsaturated benchmarks… Would love to see where you think that’s going benchmark-wise!

Expand full comment

Ruurd Kuiper

Jul 24

I really liked your article, and I think you make some excellent points. Although it's hard to guess what might be happening behind closed doors, I think everyone gets the feeling LLM improvements are slowing down. It is always easier to improve something from 50 to 90%, than it is to go from 90 to 100%. However, in this case we would preferably like it to go even 'past 100%' (figuratively speaking), as in many views AI would need to outperform ourselves to become AGI (and thus our benchmarks!)

If you are interested, I just wrote a blog post on one of the methods we might try to alleviate this slump in AI improvement. And it is inspired by the human thought process, so it seems fitting to your expertise. I'd be honored if you'd take a look:

Expand full comment

Michael Chesley Johnson

Apr 15

Tried the 17-dot test with Copilot...failed miserably.

Expand full comment

David Roberts

"...OpenAI’s $86 billion valuation could look in hindsight like a WeWork moment for AI."

Oof!

Expand full comment

Roman Peczalski

I think that your analysis here corroborates well the results published recently (https://arxiv.org/abs/2404.04125) and which you referred to in posts from 8 of April (“Breaking news: Scaling will never get us to AGI” and “Corrected link, re: new paper on scaling and AGI”). According to the cited paper, for models based on neural networks, an exponential rise in the volume of training data (and so of the computing volume) provides only a linear increase of accuracy. That means that if the data and computing volume are increasing only linearly, which is probably the case presently for the LLMs, the improvement could be very slow, near the margin of performance estimation error, nearly a plateau and not a very significant effect.

Expand full comment

Richard Self

All the evidence looks like S-curves. Slow start, rapid growth that is often interpreted as exponential growth then comes the inevitable slow down and plateau.

As we have seen on a different type of AI, in face recognition and image classification, it was quite easy to get to 50% or 70% accuracy, but then it gets ever more difficult and expensive to improve accuracy, especially in situations that need 99.99% accuracy or better.

Mind you, both LLMs and image categorisation are pattern finding algorithms, so are in many ways very similar.

Expand full comment

Gerben Wierda

And then there is that "Lies, Big Lies, Statistics, Benchmarks" thing. Like the fact that most benchmark results are reported multiple-shot (i.e. give 5 examples of good answers in the prompt, then report on the success of the 6th), or in case of Gemini multi-run (e.g. 32 complete 'drafts' from the LLM followed by selecting one to present with non-LLM methods). See https://ea.rna.nl/2023/12/08/state-of-the-art-gemini-gpt-and-friends-take-a-shot-at-learning/

These systems with huge context (prompt) sizes provide options to 'engineer around' the LLMs fundamental limitations (but also opens problems, like that huge prompt being used to slowly jailbreak safety fine-tuning — Crescendo, see https://ea.rna.nl/2024/04/12/microsoft-lays-the-limitations-of-chatgpt-and-friends-bare/).

It has been clear from the last actual paper by OpenAI on GPT that scaling effects of the LLMs themselves is log, or even log-log, or even log-log-log: https://ea.rna.nl/2024/02/13/will-sam-altmans-7-trillion-ai-plan-rescue-ai/

Expand full comment

Steven Marlow

The market for models might shrink to the point of not supporting a single "unicorn" because open source models have already "flooded the basement," as they say.

Expand full comment