38 Comments
Apr 13Liked by Gary Marcus

I find the initial statement strange and a tell of sorts. What would "doubling of capabilities" really even mean? Will LLMs double their score on some test? Will they do twice as many things in a given time period? Consume half the power for a given task? It all sounds like BS coming right out of the gate.

Expand full comment
Apr 13Liked by Gary Marcus

You recently wrote about a paper that I think better makes the case you are raising here. https://arxiv.org/abs/2404.04125. Exponential increases in data are needed for linear improvements in models.

Expand full comment
Apr 14Liked by Gary Marcus

You might define the measure of capabilities of an LLM to be the number of tasks that it solves correctly. It is finite due to the length restriction of the LLM input ("context"). Many different input texts can describe the same task; hence each task is a set of essentially equivalent texts.

Of course there is no linear effect of capacity increase on benchmark scores. However given the purpose of LLM benchmarks, the relationship should be close to monotonic. Hence the approach taken by Gary Marcus in this article to show lack of recent LLM capability improvements seems valid.

Expand full comment
Apr 14Liked by Gary Marcus

what does “compute” matter when we’ve run out of training data, and the models have the same theoretical holes, such as not being able to distinguish correlation from causation, that they have always have?

Expand full comment
Apr 13·edited Apr 13Liked by Gary Marcus

The silence from the major vendors about diminishing returns is deafening (I would have expected them to come out swinging) 🦗.

Does this mean that they are implicitly in agreement?

Expand full comment
Apr 13Liked by Gary Marcus

Interesting. Following closely.

Expand full comment

Wouldn’t you expect that companies that are flogging these will just start exploring other use cases until something gives them a return instead of throwing everything at maintaining their lead (if there is such a thing)?

It seems to me that the sunk costs of development and operations will necessitate these corporations to double and triple down, lest they fall behind.

Once you step off the exponential escalator, it’s pretty difficult to catch up don’t you think?

Expand full comment

The vast majority of world's problems can't be solved with the "most likely token" paradigm. That's why LLM were not taken seriously by almost anybody, including Google.

Given that, LLM have been a runaway success.

The simplest thing to patch them up will be to close the loop. An evaluator must check what LLM does and give it feedback. I think we'll see a lot of work towards such "grounding" and "augmentation".

I know many folks want something better. The problem is that have no idea idea how we build representations in our head. The best solution so far is to give AI many examples and let it figure out.

So, with low-hanging fruit picked, companies will have to do better modeling on top of LLM, or do something else if they can.

Expand full comment

Thank you for an excellent article. I agree that the underlying GPT-based LLMs are probably reaching a point of diminishing returns but I would suggest LLMs will show big improvements in future, just not using the GPT architecture which learns very slowly indeed. It will take some time for new architectures to be developed and for researchers to get a better theoretical understanding of how these structuralist models work. I would give it 5-10 years before the next step change in performance.

The implications are certainly for a collapse in OpenAI's mega-unicorn valuation which was basically only ever notional. (As a former VC myself, I am assuming the latest VCs got in more as a marketing and access-to-deals type step rather than a genuine expectation of VC-sized upside.) But that was always on the cards.

None of this has particular implications for AGI which remains a distant dream/nightmare*.

*Delete for preference

Expand full comment

This might be a bit obvious, but I think “GPT 4 Turbo” was supposed to be a fast version of “GPT 4”. They’re using numbers to indicate the quality of the base model, and so it’s no surprise GPT4T doesn’t show much of an improvement….

If you want to make a claim that performance is saturating on increasingly larger models, it feels like you’d need to make a claim about how GPT5 would show on the unsaturated benchmarks… Would love to see where you think that’s going benchmark-wise!

Expand full comment

Tried the 17-dot test with Copilot...failed miserably.

Expand full comment

"...OpenAI’s $86 billion valuation could look in hindsight like a WeWork moment for AI."

Oof!

Expand full comment

I think that your analysis here corroborates well the results published recently (https://arxiv.org/abs/2404.04125) and which you referred to in posts from 8 of April (“Breaking news: Scaling will never get us to AGI” and “Corrected link, re: new paper on scaling and AGI”). According to the cited paper, for models based on neural networks, an exponential rise in the volume of training data (and so of the computing volume) provides only a linear increase of accuracy. That means that if the data and computing volume are increasing only linearly, which is probably the case presently for the LLMs, the improvement could be very slow, near the margin of performance estimation error, nearly a plateau and not a very significant effect.

Expand full comment

All the evidence looks like S-curves. Slow start, rapid growth that is often interpreted as exponential growth then comes the inevitable slow down and plateau.

As we have seen on a different type of AI, in face recognition and image classification, it was quite easy to get to 50% or 70% accuracy, but then it gets ever more difficult and expensive to improve accuracy, especially in situations that need 99.99% accuracy or better.

Mind you, both LLMs and image categorisation are pattern finding algorithms, so are in many ways very similar.

Expand full comment

And then there is that "Lies, Big Lies, Statistics, Benchmarks" thing. Like the fact that most benchmark results are reported multiple-shot (i.e. give 5 examples of good answers in the prompt, then report on the success of the 6th), or in case of Gemini multi-run (e.g. 32 complete 'drafts' from the LLM followed by selecting one to present with non-LLM methods). See https://ea.rna.nl/2023/12/08/state-of-the-art-gemini-gpt-and-friends-take-a-shot-at-learning/

These systems with huge context (prompt) sizes provide options to 'engineer around' the LLMs fundamental limitations (but also opens problems, like that huge prompt being used to slowly jailbreak safety fine-tuning — Crescendo, see https://ea.rna.nl/2024/04/12/microsoft-lays-the-limitations-of-chatgpt-and-friends-bare/).

It has been clear from the last actual paper by OpenAI on GPT that scaling effects of the LLMs themselves is log, or even log-log, or even log-log-log: https://ea.rna.nl/2024/02/13/will-sam-altmans-7-trillion-ai-plan-rescue-ai/

Expand full comment

The market for models might shrink to the point of not supporting a single "unicorn" because open source models have already "flooded the basement," as they say.

Expand full comment