Indeed. Never mind that exponential growth only happens until it doesn’t. Which means that extrapolating an exponential curve without understanding what is driving the growth and when that drive might run out is nothing more than a wet finger in the wind.
The confirmation bias and echo chamber amplifies the hype. I attended a AI conference, the trend I noticed is this is more among investors than builders. Builders see the reality when they build applications (even if not all), but investors only seem to see what they want to see.
Should there be cause for concern for the future of software engineering? Could we see companies fire engineers in mass with the thought being oh the “AI can do it because the VC told us so”
What I think is a more likely heuristic is that companies that relegate software engineering predominantly to AI with minimal oversight will quickly lose their market edge / cease to be effective and competitive, leaving companies that keep focused on the strategic value of humans in software to take better positions / enjoy better outcomes.
I think that's a heuristic because it presupposes the presence of above averagely-skilled humans, where those humans make a bigger difference than a generative AI. It's also a heuristic because - at least in some cases, where software uniqueness and market differentiation isn't a key business factor - it might be at least a justifiable (if not sensible) approach. In these cases, we might see the biggest impact on well-understood, boring business-as-usual, off-the-shelf packaged software and services: software for ERP, accounting , finance, human resources, etc.
However, the most challenging, complex and unique / unprecedented problem-solution contexts are significantly less likely to benefit from an approach that predominantly uses generative AI: that's at least in part because the corpus / training set an LLM-GPT-based solution operates on will have significantly less pre-existing examples to base verifiable responses from.
Fully agree. Who would board a plane with only 50% odds of reaching its destination ...
This is aslo what most of my colleagues tell me about generated code (or meeting minutes or sales pitch, for the matter): "Sure, it's not perfect but you have to check ... that's normal". Well, if you have to check, why not just start on your own, upfront. For having had at times to troubleshoot someone else's code, I prefer whenever possible to restart from scratch. The more so, as not practicing (code development, for example) leads to total skill loss in the medium term (and this is true for many abilities).
Definitely! And not only this adagium but also the (more general) one that the earlier an issue is found (design, implementation, production, ...) the less costly its impact is on the project ...
Both approaches have their merits but the problem is that people are not always much discerning when they can "break things" and when they have to proceed in a more orderly, planned fashion ...
Thanks very much for taking time to comment and respond/ contribute to the conversation here Beth: I think your input sheds important additional light on the report and helpfully supports the discussions here. From my perspective it's appreciated: thanks again
Hi Gerben, fellow Enterprise Architect ;-), I didn't notice your name when I reacted to your comment. I did notice that you reacted to previous articles from Gary, but just not this time. Looks like we are quite frequently aligned ... ;-)
At uni in the 80s we distinguish between second hand reports, and first hand observations. Reports may be intriguing, but your duty is to read the original report, because that way you get info close to the source, and you avoid believing what is error, bias, or hype.
And a further thought .. from a trained scientist. All AI gives you is second hard reports. They may be well written, but reports are never to be considered reliable it is our duty to reaf the original research, and make our own minds up.
AI may help lead you to that research .. but it doesn't absolve us of our duty, to see for ourselves and make our own minds up.
And it's suppliers mightily wish that it did, and we'd simply accept AI output, and forget all about our duty to see for ourselves.
400 years of sound scholarship principle, abandoned
"Finally, the only thing METR looked at was “software tasks”. Software might be very different from other domains, in which case the graph (even it did make sense) might not apply."
Yes: *and* software tasks can be very different *between* software problem-solution domains. A task in one software language, hardware environment and even human problem-solution domain can be exponentially more or less complex than a similar "task" in a different one.
So: unless you're going to give very specific caveats or a clear description about the specific context in which a "software task" is being undertaken, then both the underlying task findings and the graph may or may not even be relevant, let alone be more or less applicable.
I use the Wordle test — seeing how well AI does on the NYT Games Wordle, Strands, Connections, Letter-Boxed, and Spelling Bee. As it’s ingested more game results, it’s gotten moderately better, but it’s still maddeningly mistake prone. A good way to stay humble about all AI’s advanced capabilities.
Appreciate the deep dive. Presented well! I had seen this on x and found it interesting yet quickly moved on. To me the only take away from the report seemed to be that by adding tools, individual tasks can be better combined and accomplish longer tasks. But I also watch the GitHub "agent" struggle to make a directory on a Windows PC because it leads with bash rather than power shell. Plus needs to be reminded when the codebase is larger than the context. Progress has been made, measurable is questionable.
The Bash `mkdir` is a standalone utility that creates one or more directories and, with `-p`, will build missing parent paths (returning a non-zero exit code on failure). In PowerShell, `mkdir` is just an alias for `New-Item -ItemType Directory`: it also makes multiple folders but has no `-p`—you’d use `[IO.Directory]::CreateDirectory()` or custom logic for parents—and it throws errors (silenced only with `-Force`) instead of exit codes.
ah right, one of those "advanced" use cases that I also never remember for more than one OS, since I'm sure googling/asking a chatbot would point me to the right direction (after remembering to mention WINDOWS!!! on a second try), but that doing it 1 folder at a time (or python -c something something trial error os.makedirs) is probably faster 🤔
People putting questionable dots on the graph and connecting them to show "exponentials" has been charlatan's endeavor since Kurzweil's 1999 prediction of self-driving cars by 2009.
That said, for the first time in history, the machinery does have what it takes to bring us to AGI. It just won't be so quick.
I would not be as positive as you sound, Oleg. I don't see anything in what is currently demonstrated that can lead to any form of AGI ... No understanding, no reasoning, no AGI (regardless the definition one gives to the concept)
LLM represent a massive advance when it comes to figuring out patterns in very poorly structured data when no model exists. That looked for decades like an intractable problem.
Reasoning by imitation is something chatbots can do well. Nowadays they likely also have internal tools for doing math and other simple modeling.
Not saying LLM alone will become AGI. But likely building on current techniques, with more modeling of physics, using symbolic stuff in places, and other refinements, can go quite a long way.
Sorry for the ambiguity - the labels (like "count words in passage") refer to the specific instantiations of the tasks in our suite. We're not attempting to make claims about how long it takes humans to do *any* instantiation of that task.
I don't think it's a problem for our method if there can be different versions of the same task that take humans different lengths of time. I'm a little confused why you focus on this so much - maybe I'm missing something? Many of our tasks actually come in multiple difficulty variants which take different amounts of time.
Maybe I'm misunderstanding what you're saying, but the claim that "the graph tells us nothing about which is which or when any specific question will be solved" seems wrong - we see an imperfect but fairly strong correlation between time needed for a human to complete the task and the probability that the model succeeds. Of course this isn't a perfect predictor, and there are lots of edge-cases and fiddly things about how to count human time, but it seems to be at least moderately predictive and I'd guess this relationship is robust to methodological choices about how exactly to measure human time.
I agree that you can probably construct task suites that will give you any desired time horizon, but our claim would be that if you select your tasks in any kind of reasonable and non-contrived way then you'll see this relationship between time for human and difficulty for model. And even if the absolute time horizon number is not that meaningful, the trend is more meaningful.
I agree that the Y-axis is in some sense not that good of a metric, but it was the best we've been able to come up with. I'm interested if you have thoughts on alternative approaches that could be more meaningful? It seems like your implicit y axis is something like "how easy is it to find examples of tasks that are trivial for most humans where models fail", which is not unreasonable, but it's hard to get a quantitative metric from it. I think it's also evidence that this metric is imperfect that applying it to visually-impaired humans would conclude that they're not generally intelligent (given it's easy to find tasks that are trivial for most humans but ~impossible if you're blind)
There are a couple of issues with tasks like "count words in a passage". First, there isn't any "natural" length of a passage and the time it takes people to count words in a passage (in a familiar script with easy word-break conventions) is pretty much proportional to the length of the passage. Your graph indicates that GPT-4, released in March 2023, could, with 50% chance of being correct, count the words in a passage that a person could count in 2 minutes --- about 200 words, say. (I just ran the experiment on myself.) And you claim a 7-month doubling rate. Is it the case that SotA LLM could do 100 words in August 2022, 50 words in January 2022, 400 words in November 2023, 800 words in May 2024, 2600 words now, and predicting that they will be able to do 100,000 words somewhere around April 2028? None of these variants is a less "natural" task than the one on your graph.
Second, LLMs with the ability to invoke external code, which have existed since at least July 2023, could count essentially arbitrarily long passages with essentially perfect accuracy, and increasingly this is done in ways that are invisible to the user. It's not clear why these should be excluded from your graph. The idea of a "pure" LLM is increasingly a myth when what what the end user is using is a transformer architecture with all kinds of extra bells and whistles --- hidden prompts, fine tuning, RLHF, tuned tokenizers, interfaces with external software and what not.
I don't have a better suggestion for the Y-axis, but I think that this whole approach to the question of progress in AI with a unitary scale for one particular species of AI is misguided. It seems to me that the right way to think about it is that LLMs, and more generally AI programs, and still more generally computer technology, have a variety of capabilities which increase over time but can't be usefully compared on any particular numerical scale. In the early 1940s the ENIAC and its kin could carry out highly complex physical calculations. In the late 1950s the Cobol compiler could translate code "written in English" (so it was claimed) into machine language. In the 1960s, computers were increasingly doing record keeping for banks and such. In 1976 the Kurzweil machine could read for the blind. In 2002 the Roomba was a useful, inexpensive, autonomous vacuum cleaner. What is the y-axis? It seems to me that that is the right context in which to think about the trajectory of the technology.
I certainly expect that computer technology and AIs will continue to gain in functionality. Transformer technology and next-token prediction may be a large part of that, or a small part of that, or may become as obsolete as drum memory and the OS2 operating system. Moore's law was amazingly successful, but it was focused on one quite specific aspect of one corner of the technology. I don't think you can replicate that in the larger space of functionalities and technology capabilities generally.
Thanks for the response! By the way, would you be interested in being added to our "list of sceptics" to reach out to for feedback on drafts of future papers?
The "count words" task is indeed a coding task. All of our tasks are designed to test LLM agents that can run code and use tools to achieve the task on a virtual machine.
I agree that there are many ways to select tasks (such as reading increasingly long passages of text) where human time horizon is not a good predictor of model success. The claim is not that this is a perfect predictor across all task distributions - the claim is that if you take a reasonably diverse set of tasks, human time will be correlated enough with agent performance that you can use it to say something about the rate of progress on distributions of interest (such as autonomy-requiring software and ML research tasks).
I also agree that using time horizon as a general characterization of "what can computers do" is compressing a hugely multidimensional thing into one number in a way that may not be informative. And I agree that it's not a great predictor of progress in "what can computers do" - indeed, Moravec's paradox highlights that sometimes the relationship actually goes the opposite way!
However, firstly, as the saying goes, "all models are wrong but some models are useful". We still want to be able to help humanity understand what to expect from AI and decide what preparations or mitigations we should or shouldn't make. So anything that lets us make forecasts - even in a very lossy way that throws out a lot of the detail - can still be useful if we can use it to inform questions like "should we be preparing as if there's a >5% chance of LLM agents being able to dramatically accelerate computer-based research and engineering in the next 10 years?"
Secondly, I think that the LLM agents of the last few years actually have quite similar skill profiles to each other and to humans (relative to e.g. a distribution that encompasses ENIAC, DeepBlue, Roombas, + convnets), if we define a distribution of tasks (research and engineering tasks that you can do from a computer, that don't require too much GUIs use or visual acuity) - that is quite practical for predicting real-world impacts. Yes, there will still be some 1 min tasks models fail at and some 8hr tasks that they succeed at, but if all you know about a coding task is that it took a human 1 minute vs 8 hours, you should bet heavily that an LLM agent is more likely to succeed at the 1 minute one.
I'm not sure whether you actually disagree with any of the above, and whether you think this methodology has such big issues that it's not worth pursuing and can't help us be wiser about what to expect from LLMs in the future, or whether you are mostly just objecting to the oversimplified claim along the lines of "soon computers will be able to do anything a human can do in a month"
As an alternative approach to predicting progress in computer technology, you might want to look at Rod Brooks' annual predictions. He's been doing it since 2018, and so far his batting average is pretty fair. Anyway, I like his methodology.
A benchmark of getting things 50% correct is really concerning to me -- if a person only got these problems correct 50% of the time, we would not be impressed. Really it highlights to me a deep problem in consistency for these models, it shows us exactly how they *aren't* reasoning. Another example came up when I was talking to a friend about this, he was explaining to me some of the math olympiad tests that were being given to these models. What struck me was that even when a model could answer a difficult instance of a particular problem, it could not always answer easier instances of it. To me, this really shows just how little "reasoning" these systems do, and that any "reasoning" seems much more like going down different avenues of various training sets than understanding what the question is asking and applying relevant information. I would expect any math college student who could solve, say, a 2-dimensional statistics problem to be able to solve basically any 1-dimensional version of the same problem. The fact that we can't expect these models to do the same says a lot.
As someone who was interviewing software engineers for last 20+ years I can tell you that I look for precisely TWO things:
1. Ability to solve simple tasks without supervision and stupid mistakes (that way I can dump some work on them and not worry about them sabotaging me at every turn).
2. Ability to recognize their limitations and ASK ME when in doubt and when task it too complex for them (that way I don't waste time on them to second-guess them and proactively offer help… I can rely on them asking me, when they are in bind).
And that benchmark is purposefully arranged in a way as to not even attempt to measure these two, most important questions.
IOW their benchmark is deliberately defines AGI as “some human who any sane employer would want to not ever hire”… this may be useful for something or someone, but I would say it's as far from AGI as they can imagine: if someone pretends to be an AGI and said someone have to treated like human employer would, then criteria should be similar to how we treat employees, isn't ?
I remember the most egregious case when someone genuinely bright was rejected by hiring committee after me interviewing them, because interview was in the form “can you do X always or is it sometimes impossible” – and candidate was insisting that X is impossible with me prompting why around 10 times “X is impossible because you need Y and it's obviously impossible”, “but why is Y impossible?”, “because to do Y you need Z that's obviously impossible”, “but why is Z impossible?”,… after 10 steps we arrived at something that was “obviously possible” and candidate was rejected… solely because candidate refused to show any shred of self-doubt.
And now they throw away all that wisdom to invent something that would allow them to rubber-stamp AI as AGI? Gosh…
The mere fact that they are emulating human software engineers and their tools, compilers/interpreters etc. boggles my mind. AGI? If there was any possibility of LLM’s writing, debugging and maintaining code effectively and efficiently it would be in machine code. Even then assembly language and CPUs are the way they are for humans. When the investors pivot to chip design for software creation by “AI” then I might start believing the hype. Just to add, if actual software engineers and their output was as bad as this(50%?) they would have long since been moved on to other things, AI hype on X perhaps?
“We can only shake our heads.”
Indeed. Never mind that exponential growth only happens until it doesn’t. Which means that extrapolating an exponential curve without understanding what is driving the growth and when that drive might run out is nothing more than a wet finger in the wind.
damn, i meant to say that and forgot. quite right.
I knew what you meant. Got your back.
You can also shake your botty.
Shake shake shake.
The confirmation bias and echo chamber amplifies the hype. I attended a AI conference, the trend I noticed is this is more among investors than builders. Builders see the reality when they build applications (even if not all), but investors only seem to see what they want to see.
Setting the bar to 50% correct seems too low. What conventional project can live with less than 95%?
Using the 50% success point is based on standard practice for measuring human ability levels ("Item Response Theory"). See e.g. https://www.researchgate.net/figure/The-goal-of-item-response-theory-IRT-is-to-predict-the-probability-that-an-examinee_fig2_220249772
I can't wait for all this LLM nonsense to burn itself out.
As Neil Young likes to sing
“A, A, I, I. LLMs will never die. There’s more to the picture, than meets the eye. A, A, I, I.
I,I, A, A. LLMs are here to stay. It’s better to burn out, than to fade away. I,I, A,A
I think you meant THE OTHER Neil Young song about GPT/LLM-primary AI:
Saw social's from a "friend"
On influence we depend
But found out in the end
It was a piece of crap
Saw it on YouTube
Put it on your phone
Now you see it's shown
It's just a piece of crap
I typed a prompt right in
Then tried to fix "the wrong"
Hours to get it going
It's still a piece of crap
Should there be cause for concern for the future of software engineering? Could we see companies fire engineers in mass with the thought being oh the “AI can do it because the VC told us so”
In short: probably not much cause / concern.
What I think is a more likely heuristic is that companies that relegate software engineering predominantly to AI with minimal oversight will quickly lose their market edge / cease to be effective and competitive, leaving companies that keep focused on the strategic value of humans in software to take better positions / enjoy better outcomes.
I think that's a heuristic because it presupposes the presence of above averagely-skilled humans, where those humans make a bigger difference than a generative AI. It's also a heuristic because - at least in some cases, where software uniqueness and market differentiation isn't a key business factor - it might be at least a justifiable (if not sensible) approach. In these cases, we might see the biggest impact on well-understood, boring business-as-usual, off-the-shelf packaged software and services: software for ERP, accounting , finance, human resources, etc.
However, the most challenging, complex and unique / unprecedented problem-solution contexts are significantly less likely to benefit from an approach that predominantly uses generative AI: that's at least in part because the corpus / training set an LLM-GPT-based solution operates on will have significantly less pre-existing examples to base verifiable responses from.
Sorry, what kind of a usefulness is getting a result 50% of the time? Such a measure effectively hides that nasty key issue in the background.
And don't forget: the start of an S curve looks suspiciously like an exponential one.
Fully agree. Who would board a plane with only 50% odds of reaching its destination ...
This is aslo what most of my colleagues tell me about generated code (or meeting minutes or sales pitch, for the matter): "Sure, it's not perfect but you have to check ... that's normal". Well, if you have to check, why not just start on your own, upfront. For having had at times to troubleshoot someone else's code, I prefer whenever possible to restart from scratch. The more so, as not practicing (code development, for example) leads to total skill loss in the medium term (and this is true for many abilities).
I suppose the old adagium "every hour spent in thinking/design prevents seven hours in troubleshooting/debugging" still holds.
Definitely! And not only this adagium but also the (more general) one that the earlier an issue is found (design, implementation, production, ...) the less costly its impact is on the project ...
which — it must be said — is really different from "move fast and break things".
Both approaches have their merits but the problem is that people are not always much discerning when they can "break things" and when they have to proceed in a more orderly, planned fashion ...
Using the 50% success point is based on standard practice for measuring human ability levels ("Item Response Theory"). See e.g. https://www.researchgate.net/figure/The-goal-of-item-response-theory-IRT-is-to-predict-the-probability-that-an-examinee_fig2_220249772
You can use whatever threshold you like - although the time horizon obviously will be lower for higher success probability
This is interesting, I am going to dig in later. Thanks.
Thanks very much for taking time to comment and respond/ contribute to the conversation here Beth: I think your input sheds important additional light on the report and helpfully supports the discussions here. From my perspective it's appreciated: thanks again
Hi Gerben, fellow Enterprise Architect ;-), I didn't notice your name when I reacted to your comment. I did notice that you reacted to previous articles from Gary, but just not this time. Looks like we are quite frequently aligned ... ;-)
Another brilliant post.
At uni in the 80s we distinguish between second hand reports, and first hand observations. Reports may be intriguing, but your duty is to read the original report, because that way you get info close to the source, and you avoid believing what is error, bias, or hype.
And a further thought .. from a trained scientist. All AI gives you is second hard reports. They may be well written, but reports are never to be considered reliable it is our duty to reaf the original research, and make our own minds up.
AI may help lead you to that research .. but it doesn't absolve us of our duty, to see for ourselves and make our own minds up.
And it's suppliers mightily wish that it did, and we'd simply accept AI output, and forget all about our duty to see for ourselves.
400 years of sound scholarship principle, abandoned
"Finally, the only thing METR looked at was “software tasks”. Software might be very different from other domains, in which case the graph (even it did make sense) might not apply."
Yes: *and* software tasks can be very different *between* software problem-solution domains. A task in one software language, hardware environment and even human problem-solution domain can be exponentially more or less complex than a similar "task" in a different one.
So: unless you're going to give very specific caveats or a clear description about the specific context in which a "software task" is being undertaken, then both the underlying task findings and the graph may or may not even be relevant, let alone be more or less applicable.
I use the Wordle test — seeing how well AI does on the NYT Games Wordle, Strands, Connections, Letter-Boxed, and Spelling Bee. As it’s ingested more game results, it’s gotten moderately better, but it’s still maddeningly mistake prone. A good way to stay humble about all AI’s advanced capabilities.
Appreciate the deep dive. Presented well! I had seen this on x and found it interesting yet quickly moved on. To me the only take away from the report seemed to be that by adding tools, individual tasks can be better combined and accomplish longer tasks. But I also watch the GitHub "agent" struggle to make a directory on a Windows PC because it leads with bash rather than power shell. Plus needs to be reminded when the codebase is larger than the context. Progress has been made, measurable is questionable.
wait, what?! doesn't mkdir work the ~same on both?
The Bash `mkdir` is a standalone utility that creates one or more directories and, with `-p`, will build missing parent paths (returning a non-zero exit code on failure). In PowerShell, `mkdir` is just an alias for `New-Item -ItemType Directory`: it also makes multiple folders but has no `-p`—you’d use `[IO.Directory]::CreateDirectory()` or custom logic for parents—and it throws errors (silenced only with `-Force`) instead of exit codes.
ah right, one of those "advanced" use cases that I also never remember for more than one OS, since I'm sure googling/asking a chatbot would point me to the right direction (after remembering to mention WINDOWS!!! on a second try), but that doing it 1 folder at a time (or python -c something something trial error os.makedirs) is probably faster 🤔
People putting questionable dots on the graph and connecting them to show "exponentials" has been charlatan's endeavor since Kurzweil's 1999 prediction of self-driving cars by 2009.
That said, for the first time in history, the machinery does have what it takes to bring us to AGI. It just won't be so quick.
I would not be as positive as you sound, Oleg. I don't see anything in what is currently demonstrated that can lead to any form of AGI ... No understanding, no reasoning, no AGI (regardless the definition one gives to the concept)
LLM represent a massive advance when it comes to figuring out patterns in very poorly structured data when no model exists. That looked for decades like an intractable problem.
Reasoning by imitation is something chatbots can do well. Nowadays they likely also have internal tools for doing math and other simple modeling.
Not saying LLM alone will become AGI. But likely building on current techniques, with more modeling of physics, using symbolic stuff in places, and other refinements, can go quite a long way.
Thoughts on Moore's Law?
Sorry for the ambiguity - the labels (like "count words in passage") refer to the specific instantiations of the tasks in our suite. We're not attempting to make claims about how long it takes humans to do *any* instantiation of that task.
I don't think it's a problem for our method if there can be different versions of the same task that take humans different lengths of time. I'm a little confused why you focus on this so much - maybe I'm missing something? Many of our tasks actually come in multiple difficulty variants which take different amounts of time.
Maybe I'm misunderstanding what you're saying, but the claim that "the graph tells us nothing about which is which or when any specific question will be solved" seems wrong - we see an imperfect but fairly strong correlation between time needed for a human to complete the task and the probability that the model succeeds. Of course this isn't a perfect predictor, and there are lots of edge-cases and fiddly things about how to count human time, but it seems to be at least moderately predictive and I'd guess this relationship is robust to methodological choices about how exactly to measure human time.
I agree that you can probably construct task suites that will give you any desired time horizon, but our claim would be that if you select your tasks in any kind of reasonable and non-contrived way then you'll see this relationship between time for human and difficulty for model. And even if the absolute time horizon number is not that meaningful, the trend is more meaningful.
I agree that the Y-axis is in some sense not that good of a metric, but it was the best we've been able to come up with. I'm interested if you have thoughts on alternative approaches that could be more meaningful? It seems like your implicit y axis is something like "how easy is it to find examples of tasks that are trivial for most humans where models fail", which is not unreasonable, but it's hard to get a quantitative metric from it. I think it's also evidence that this metric is imperfect that applying it to visually-impaired humans would conclude that they're not generally intelligent (given it's easy to find tasks that are trivial for most humans but ~impossible if you're blind)
Beth --
Thanks very much for the thoughtful answer.
There are a couple of issues with tasks like "count words in a passage". First, there isn't any "natural" length of a passage and the time it takes people to count words in a passage (in a familiar script with easy word-break conventions) is pretty much proportional to the length of the passage. Your graph indicates that GPT-4, released in March 2023, could, with 50% chance of being correct, count the words in a passage that a person could count in 2 minutes --- about 200 words, say. (I just ran the experiment on myself.) And you claim a 7-month doubling rate. Is it the case that SotA LLM could do 100 words in August 2022, 50 words in January 2022, 400 words in November 2023, 800 words in May 2024, 2600 words now, and predicting that they will be able to do 100,000 words somewhere around April 2028? None of these variants is a less "natural" task than the one on your graph.
Second, LLMs with the ability to invoke external code, which have existed since at least July 2023, could count essentially arbitrarily long passages with essentially perfect accuracy, and increasingly this is done in ways that are invisible to the user. It's not clear why these should be excluded from your graph. The idea of a "pure" LLM is increasingly a myth when what what the end user is using is a transformer architecture with all kinds of extra bells and whistles --- hidden prompts, fine tuning, RLHF, tuned tokenizers, interfaces with external software and what not.
I don't have a better suggestion for the Y-axis, but I think that this whole approach to the question of progress in AI with a unitary scale for one particular species of AI is misguided. It seems to me that the right way to think about it is that LLMs, and more generally AI programs, and still more generally computer technology, have a variety of capabilities which increase over time but can't be usefully compared on any particular numerical scale. In the early 1940s the ENIAC and its kin could carry out highly complex physical calculations. In the late 1950s the Cobol compiler could translate code "written in English" (so it was claimed) into machine language. In the 1960s, computers were increasingly doing record keeping for banks and such. In 1976 the Kurzweil machine could read for the blind. In 2002 the Roomba was a useful, inexpensive, autonomous vacuum cleaner. What is the y-axis? It seems to me that that is the right context in which to think about the trajectory of the technology.
I certainly expect that computer technology and AIs will continue to gain in functionality. Transformer technology and next-token prediction may be a large part of that, or a small part of that, or may become as obsolete as drum memory and the OS2 operating system. Moore's law was amazingly successful, but it was focused on one quite specific aspect of one corner of the technology. I don't think you can replicate that in the larger space of functionalities and technology capabilities generally.
Thanks for the response! By the way, would you be interested in being added to our "list of sceptics" to reach out to for feedback on drafts of future papers?
The "count words" task is indeed a coding task. All of our tasks are designed to test LLM agents that can run code and use tools to achieve the task on a virtual machine.
I agree that there are many ways to select tasks (such as reading increasingly long passages of text) where human time horizon is not a good predictor of model success. The claim is not that this is a perfect predictor across all task distributions - the claim is that if you take a reasonably diverse set of tasks, human time will be correlated enough with agent performance that you can use it to say something about the rate of progress on distributions of interest (such as autonomy-requiring software and ML research tasks).
I also agree that using time horizon as a general characterization of "what can computers do" is compressing a hugely multidimensional thing into one number in a way that may not be informative. And I agree that it's not a great predictor of progress in "what can computers do" - indeed, Moravec's paradox highlights that sometimes the relationship actually goes the opposite way!
However, firstly, as the saying goes, "all models are wrong but some models are useful". We still want to be able to help humanity understand what to expect from AI and decide what preparations or mitigations we should or shouldn't make. So anything that lets us make forecasts - even in a very lossy way that throws out a lot of the detail - can still be useful if we can use it to inform questions like "should we be preparing as if there's a >5% chance of LLM agents being able to dramatically accelerate computer-based research and engineering in the next 10 years?"
Secondly, I think that the LLM agents of the last few years actually have quite similar skill profiles to each other and to humans (relative to e.g. a distribution that encompasses ENIAC, DeepBlue, Roombas, + convnets), if we define a distribution of tasks (research and engineering tasks that you can do from a computer, that don't require too much GUIs use or visual acuity) - that is quite practical for predicting real-world impacts. Yes, there will still be some 1 min tasks models fail at and some 8hr tasks that they succeed at, but if all you know about a coding task is that it took a human 1 minute vs 8 hours, you should bet heavily that an LLM agent is more likely to succeed at the 1 minute one.
I'm not sure whether you actually disagree with any of the above, and whether you think this methodology has such big issues that it's not worth pursuing and can't help us be wiser about what to expect from LLMs in the future, or whether you are mostly just objecting to the oversimplified claim along the lines of "soon computers will be able to do anything a human can do in a month"
Beth --
By all means add me to your list of skeptics. davise@cs.nyu.edu
As an alternative approach to predicting progress in computer technology, you might want to look at Rod Brooks' annual predictions. He's been doing it since 2018, and so far his batting average is pretty fair. Anyway, I like his methodology.
https://rodneybrooks.com/predictions-scorecard-2025-january-01/
I wrote a poem grumbling about the overuse and misuse of the epigram "All models are wrong but some models are useful" a couple of years ago.
https://cs.nyu.edu/~davise/Verses/Models.html
A benchmark of getting things 50% correct is really concerning to me -- if a person only got these problems correct 50% of the time, we would not be impressed. Really it highlights to me a deep problem in consistency for these models, it shows us exactly how they *aren't* reasoning. Another example came up when I was talking to a friend about this, he was explaining to me some of the math olympiad tests that were being given to these models. What struck me was that even when a model could answer a difficult instance of a particular problem, it could not always answer easier instances of it. To me, this really shows just how little "reasoning" these systems do, and that any "reasoning" seems much more like going down different avenues of various training sets than understanding what the question is asking and applying relevant information. I would expect any math college student who could solve, say, a 2-dimensional statistics problem to be able to solve basically any 1-dimensional version of the same problem. The fact that we can't expect these models to do the same says a lot.
As someone who was interviewing software engineers for last 20+ years I can tell you that I look for precisely TWO things:
1. Ability to solve simple tasks without supervision and stupid mistakes (that way I can dump some work on them and not worry about them sabotaging me at every turn).
2. Ability to recognize their limitations and ASK ME when in doubt and when task it too complex for them (that way I don't waste time on them to second-guess them and proactively offer help… I can rely on them asking me, when they are in bind).
And that benchmark is purposefully arranged in a way as to not even attempt to measure these two, most important questions.
IOW their benchmark is deliberately defines AGI as “some human who any sane employer would want to not ever hire”… this may be useful for something or someone, but I would say it's as far from AGI as they can imagine: if someone pretends to be an AGI and said someone have to treated like human employer would, then criteria should be similar to how we treat employees, isn't ?
I remember the most egregious case when someone genuinely bright was rejected by hiring committee after me interviewing them, because interview was in the form “can you do X always or is it sometimes impossible” – and candidate was insisting that X is impossible with me prompting why around 10 times “X is impossible because you need Y and it's obviously impossible”, “but why is Y impossible?”, “because to do Y you need Z that's obviously impossible”, “but why is Z impossible?”,… after 10 steps we arrived at something that was “obviously possible” and candidate was rejected… solely because candidate refused to show any shred of self-doubt.
And now they throw away all that wisdom to invent something that would allow them to rubber-stamp AI as AGI? Gosh…
Using the 50% success point is based on standard practice for measuring human ability levels ("Item Response Theory"). See e.g. https://www.researchgate.net/figure/The-goal-of-item-response-theory-IRT-is-to-predict-the-probability-that-an-examinee_fig2_220249772
You can use whatever success threshold you like, the methodology still works
I believe that a “benchmark of getting things 50% correct”would be called a “stenchmark”
Some see the glass half full.
Others see it half empty.
But anyway you look at it, it’s half assed.
It all reminds me of the Greenspan coined term " Irrational Exuberance " .
You gotta love the “error bars” on that graph.
What complete BS.
“BS uncertainty”?
The mere fact that they are emulating human software engineers and their tools, compilers/interpreters etc. boggles my mind. AGI? If there was any possibility of LLM’s writing, debugging and maintaining code effectively and efficiently it would be in machine code. Even then assembly language and CPUs are the way they are for humans. When the investors pivot to chip design for software creation by “AI” then I might start believing the hype. Just to add, if actual software engineers and their output was as bad as this(50%?) they would have long since been moved on to other things, AI hype on X perhaps?