The conventional wisdom, well captured recently by Ethan Mollick, is that LLMs are advancing exponentially. A few days ago, in very popular blog post, Mollick claimed that “the current best estimates of the rate of improvement in Large Language models show capabilities doubling every 5 to 14 months”:
I find the initial statement strange and a tell of sorts. What would "doubling of capabilities" really even mean? Will LLMs double their score on some test? Will they do twice as many things in a given time period? Consume half the power for a given task? It all sounds like BS coming right out of the gate.
I wondered that as well. If you are at 85% accuracy, it's not possible to double that. Perhaps you could double the areas of interest in which you can get 85%+ accuracy, but other than that, as Gary points out 100% doesn't have any room past the top.
I think a reasonable transform to apply would be the odds ratio, p / (1 - p), where p is in this case the probability of a correct answer. To use some sample numbers from Gary's post, 54% correct gives on odds ratio of ~1.17; 85% gives ~5.7, which is more than 4x better; but then 87% gives ~6.7, which is only a silght further improvement. But if we got to, say, 93%, the odds ratio would be ~13.3 -- almost another doubling. I think it would be fair to say that 93% is about twice as good as 87%.
Yeah, I was thinking of error rate reduction, which could go on halving forever, very similarly to odds ratio doubling - basically the dynamics you would expect from logistic growth which rises exponentially initially but then turns to inverse-exponentially approaching an asymptote as resource constraints set in. But still, I don't think the hype guy who said the vague phrase "doubling of capabilities" had any such rigorous definition in mind.
You recently wrote about a paper that I think better makes the case you are raising here. https://arxiv.org/abs/2404.04125. Exponential increases in data are needed for linear improvements in models.
You might define the measure of capabilities of an LLM to be the number of tasks that it solves correctly. It is finite due to the length restriction of the LLM input ("context"). Many different input texts can describe the same task; hence each task is a set of essentially equivalent texts.
Of course there is no linear effect of capacity increase on benchmark scores. However given the purpose of LLM benchmarks, the relationship should be close to monotonic. Hence the approach taken by Gary Marcus in this article to show lack of recent LLM capability improvements seems valid.
what does “compute” matter when we’ve run out of training data, and the models have the same theoretical holes, such as not being able to distinguish correlation from causation, that they have always have?
Wouldn’t you expect that companies that are flogging these will just start exploring other use cases until something gives them a return instead of throwing everything at maintaining their lead (if there is such a thing)?
It seems to me that the sunk costs of development and operations will necessitate these corporations to double and triple down, lest they fall behind.
Once you step off the exponential escalator, it’s pretty difficult to catch up don’t you think?
The vast majority of world's problems can't be solved with the "most likely token" paradigm. That's why LLM were not taken seriously by almost anybody, including Google.
Given that, LLM have been a runaway success.
The simplest thing to patch them up will be to close the loop. An evaluator must check what LLM does and give it feedback. I think we'll see a lot of work towards such "grounding" and "augmentation".
I know many folks want something better. The problem is that have no idea idea how we build representations in our head. The best solution so far is to give AI many examples and let it figure out.
So, with low-hanging fruit picked, companies will have to do better modeling on top of LLM, or do something else if they can.
An interesting direction I read about is hyper-specialized smaller bots, each trained on a lot of focused data, and hopefully with some mechanism for doing correction and invoking add-on world models.
Thank you for an excellent article. I agree that the underlying GPT-based LLMs are probably reaching a point of diminishing returns but I would suggest LLMs will show big improvements in future, just not using the GPT architecture which learns very slowly indeed. It will take some time for new architectures to be developed and for researchers to get a better theoretical understanding of how these structuralist models work. I would give it 5-10 years before the next step change in performance.
The implications are certainly for a collapse in OpenAI's mega-unicorn valuation which was basically only ever notional. (As a former VC myself, I am assuming the latest VCs got in more as a marketing and access-to-deals type step rather than a genuine expectation of VC-sized upside.) But that was always on the cards.
None of this has particular implications for AGI which remains a distant dream/nightmare*.
This might be a bit obvious, but I think “GPT 4 Turbo” was supposed to be a fast version of “GPT 4”. They’re using numbers to indicate the quality of the base model, and so it’s no surprise GPT4T doesn’t show much of an improvement….
If you want to make a claim that performance is saturating on increasingly larger models, it feels like you’d need to make a claim about how GPT5 would show on the unsaturated benchmarks… Would love to see where you think that’s going benchmark-wise!
I really liked your article, and I think you make some excellent points. Although it's hard to guess what might be happening behind closed doors, I think everyone gets the feeling LLM improvements are slowing down. It is always easier to improve something from 50 to 90%, than it is to go from 90 to 100%. However, in this case we would preferably like it to go even 'past 100%' (figuratively speaking), as in many views AI would need to outperform ourselves to become AGI (and thus our benchmarks!)
If you are interested, I just wrote a blog post on one of the methods we might try to alleviate this slump in AI improvement. And it is inspired by the human thought process, so it seems fitting to your expertise. I'd be honored if you'd take a look:
I think that your analysis here corroborates well the results published recently (https://arxiv.org/abs/2404.04125) and which you referred to in posts from 8 of April (“Breaking news: Scaling will never get us to AGI” and “Corrected link, re: new paper on scaling and AGI”). According to the cited paper, for models based on neural networks, an exponential rise in the volume of training data (and so of the computing volume) provides only a linear increase of accuracy. That means that if the data and computing volume are increasing only linearly, which is probably the case presently for the LLMs, the improvement could be very slow, near the margin of performance estimation error, nearly a plateau and not a very significant effect.
All the evidence looks like S-curves. Slow start, rapid growth that is often interpreted as exponential growth then comes the inevitable slow down and plateau.
As we have seen on a different type of AI, in face recognition and image classification, it was quite easy to get to 50% or 70% accuracy, but then it gets ever more difficult and expensive to improve accuracy, especially in situations that need 99.99% accuracy or better.
Mind you, both LLMs and image categorisation are pattern finding algorithms, so are in many ways very similar.
And then there is that "Lies, Big Lies, Statistics, Benchmarks" thing. Like the fact that most benchmark results are reported multiple-shot (i.e. give 5 examples of good answers in the prompt, then report on the success of the 6th), or in case of Gemini multi-run (e.g. 32 complete 'drafts' from the LLM followed by selecting one to present with non-LLM methods). See https://ea.rna.nl/2023/12/08/state-of-the-art-gemini-gpt-and-friends-take-a-shot-at-learning/
I find the initial statement strange and a tell of sorts. What would "doubling of capabilities" really even mean? Will LLMs double their score on some test? Will they do twice as many things in a given time period? Consume half the power for a given task? It all sounds like BS coming right out of the gate.
Doubling of capabilities means cooking the planet twice as fast or draining investors pockets twice as fast.
Let's go for emptying pockets! VCs might learn something one day.
I wondered that as well. If you are at 85% accuracy, it's not possible to double that. Perhaps you could double the areas of interest in which you can get 85%+ accuracy, but other than that, as Gary points out 100% doesn't have any room past the top.
I think a reasonable transform to apply would be the odds ratio, p / (1 - p), where p is in this case the probability of a correct answer. To use some sample numbers from Gary's post, 54% correct gives on odds ratio of ~1.17; 85% gives ~5.7, which is more than 4x better; but then 87% gives ~6.7, which is only a silght further improvement. But if we got to, say, 93%, the odds ratio would be ~13.3 -- almost another doubling. I think it would be fair to say that 93% is about twice as good as 87%.
Yeah, I was thinking of error rate reduction, which could go on halving forever, very similarly to odds ratio doubling - basically the dynamics you would expect from logistic growth which rises exponentially initially but then turns to inverse-exponentially approaching an asymptote as resource constraints set in. But still, I don't think the hype guy who said the vague phrase "doubling of capabilities" had any such rigorous definition in mind.
Simply invent new harder tests that "old" Ais will have a low score on like 40% for GPT 4 and then the new model gets 80% or something.
Given Moore's Law they will consume half as much power every 18 months.
Given Jevon's paradox, they will consume more power every 18 months.
You recently wrote about a paper that I think better makes the case you are raising here. https://arxiv.org/abs/2404.04125. Exponential increases in data are needed for linear improvements in models.
I was trying to find that post, but funny enough I cannot find it). Gary?
You might define the measure of capabilities of an LLM to be the number of tasks that it solves correctly. It is finite due to the length restriction of the LLM input ("context"). Many different input texts can describe the same task; hence each task is a set of essentially equivalent texts.
Of course there is no linear effect of capacity increase on benchmark scores. However given the purpose of LLM benchmarks, the relationship should be close to monotonic. Hence the approach taken by Gary Marcus in this article to show lack of recent LLM capability improvements seems valid.
what does “compute” matter when we’ve run out of training data, and the models have the same theoretical holes, such as not being able to distinguish correlation from causation, that they have always have?
The silence from the major vendors about diminishing returns is deafening (I would have expected them to come out swinging) 🦗.
Does this mean that they are implicitly in agreement?
Interesting. Following closely.
Wouldn’t you expect that companies that are flogging these will just start exploring other use cases until something gives them a return instead of throwing everything at maintaining their lead (if there is such a thing)?
It seems to me that the sunk costs of development and operations will necessitate these corporations to double and triple down, lest they fall behind.
Once you step off the exponential escalator, it’s pretty difficult to catch up don’t you think?
The vast majority of world's problems can't be solved with the "most likely token" paradigm. That's why LLM were not taken seriously by almost anybody, including Google.
Given that, LLM have been a runaway success.
The simplest thing to patch them up will be to close the loop. An evaluator must check what LLM does and give it feedback. I think we'll see a lot of work towards such "grounding" and "augmentation".
I know many folks want something better. The problem is that have no idea idea how we build representations in our head. The best solution so far is to give AI many examples and let it figure out.
So, with low-hanging fruit picked, companies will have to do better modeling on top of LLM, or do something else if they can.
An interesting direction I read about is hyper-specialized smaller bots, each trained on a lot of focused data, and hopefully with some mechanism for doing correction and invoking add-on world models.
Thank you for an excellent article. I agree that the underlying GPT-based LLMs are probably reaching a point of diminishing returns but I would suggest LLMs will show big improvements in future, just not using the GPT architecture which learns very slowly indeed. It will take some time for new architectures to be developed and for researchers to get a better theoretical understanding of how these structuralist models work. I would give it 5-10 years before the next step change in performance.
The implications are certainly for a collapse in OpenAI's mega-unicorn valuation which was basically only ever notional. (As a former VC myself, I am assuming the latest VCs got in more as a marketing and access-to-deals type step rather than a genuine expectation of VC-sized upside.) But that was always on the cards.
None of this has particular implications for AGI which remains a distant dream/nightmare*.
*Delete for preference
This might be a bit obvious, but I think “GPT 4 Turbo” was supposed to be a fast version of “GPT 4”. They’re using numbers to indicate the quality of the base model, and so it’s no surprise GPT4T doesn’t show much of an improvement….
If you want to make a claim that performance is saturating on increasingly larger models, it feels like you’d need to make a claim about how GPT5 would show on the unsaturated benchmarks… Would love to see where you think that’s going benchmark-wise!
I really liked your article, and I think you make some excellent points. Although it's hard to guess what might be happening behind closed doors, I think everyone gets the feeling LLM improvements are slowing down. It is always easier to improve something from 50 to 90%, than it is to go from 90 to 100%. However, in this case we would preferably like it to go even 'past 100%' (figuratively speaking), as in many views AI would need to outperform ourselves to become AGI (and thus our benchmarks!)
If you are interested, I just wrote a blog post on one of the methods we might try to alleviate this slump in AI improvement. And it is inspired by the human thought process, so it seems fitting to your expertise. I'd be honored if you'd take a look:
https://medium.com/ai-advances/think-before-you-speak-5611bcbbbd4c
Tried the 17-dot test with Copilot...failed miserably.
"...OpenAI’s $86 billion valuation could look in hindsight like a WeWork moment for AI."
Oof!
I think that your analysis here corroborates well the results published recently (https://arxiv.org/abs/2404.04125) and which you referred to in posts from 8 of April (“Breaking news: Scaling will never get us to AGI” and “Corrected link, re: new paper on scaling and AGI”). According to the cited paper, for models based on neural networks, an exponential rise in the volume of training data (and so of the computing volume) provides only a linear increase of accuracy. That means that if the data and computing volume are increasing only linearly, which is probably the case presently for the LLMs, the improvement could be very slow, near the margin of performance estimation error, nearly a plateau and not a very significant effect.
All the evidence looks like S-curves. Slow start, rapid growth that is often interpreted as exponential growth then comes the inevitable slow down and plateau.
As we have seen on a different type of AI, in face recognition and image classification, it was quite easy to get to 50% or 70% accuracy, but then it gets ever more difficult and expensive to improve accuracy, especially in situations that need 99.99% accuracy or better.
Mind you, both LLMs and image categorisation are pattern finding algorithms, so are in many ways very similar.
And then there is that "Lies, Big Lies, Statistics, Benchmarks" thing. Like the fact that most benchmark results are reported multiple-shot (i.e. give 5 examples of good answers in the prompt, then report on the success of the 6th), or in case of Gemini multi-run (e.g. 32 complete 'drafts' from the LLM followed by selecting one to present with non-LLM methods). See https://ea.rna.nl/2023/12/08/state-of-the-art-gemini-gpt-and-friends-take-a-shot-at-learning/
These systems with huge context (prompt) sizes provide options to 'engineer around' the LLMs fundamental limitations (but also opens problems, like that huge prompt being used to slowly jailbreak safety fine-tuning — Crescendo, see https://ea.rna.nl/2024/04/12/microsoft-lays-the-limitations-of-chatgpt-and-friends-bare/).
It has been clear from the last actual paper by OpenAI on GPT that scaling effects of the LLMs themselves is log, or even log-log, or even log-log-log: https://ea.rna.nl/2024/02/13/will-sam-altmans-7-trillion-ai-plan-rescue-ai/