The 50% versus reliability distinction is the most important methodological point in this piece and it maps directly onto something I keep seeing in practice with teams deploying these tools.
A system that succeeds at 16-hour tasks 50% of the time is simultaneously two completley different products depending on who's using it and how. In a research setting where a human checks the output before anything ships, 50% success is genuinely transformative. You run the agent, review the result, keep the wins and discard the rest. Most developers using Claude Code right now work exactly this way and its why the product feels revolutionary to them even though the raw success rate would terrify an enterprise buyer.
In a production deployment where the output goes directly to a customer, 50% is a disaster. The reliability threshold for autonomous commercial use is somewhere between 95% and 99.9% depending on the domain and the METR graph doesnt tell you anything about how fast that gap is closing because it only measures the 50% line. The gap between 50% and 99% is where the entire "are we close to AGI" debate actually lives and I think thats the single most useful framing anyone can take from this piece.
The log-scale point needs to be said louder. Linear progress on a log scale looks exponential and the visual design of that graph is doing enormous persuasive work that the underlying data doesnt support. Plot the same improvements on a linear y-axis and you get steady incremental gains, impressive but not the hockey stick that triggered the panic.
Your observation about symbolic tools is the one that will age best though. If a big chunk of the improvement comes from models learning to use search and code execution tools rather than from the neural network getting fundamentally smarter then the progress curve hits a ceiling wherever tool integration saturates. And once it does the bottleneck shifts back to core reasoning which is exactly where the limitations youve been documenting for years reassert themselves. Both readings of this graph are probaly correct simultaneously and the honest response is somewhere between the panic and the dismissal.
<Your observation about symbolic tools is the one that will age best though. If a big chunk of the improvement comes from models learning to use search>
I was recently using Google Gemini to try to find movie themes that matched a target movie, e.g., "Find me movies that are like the 'African Queen' (1954) with Bogart and Hepburn". I assumed it would trawl through the movies, read the brief synopsis and other info, and return some movies. To make sure it wasn't hallucinating, I have teh Ai provide details so that I can verify the output. On the first run, it just used Google search. So the next run I asked it to only use IMDB. It returned some good movies with explanations of why they were similar. BUT, it turned out is was using IMDB's Advanced Search to find similar movies. Clever, but it was just using a [symbolic] tool, probably an SQL query to find the movies. IOW, it was doing anything I was looking for to test whether it could reason by analogy, but simply tapped into the movie database using a search function. Gemini even showed me how to do the search (which was nice - I learned something), but it was clearly not trying to find movies by analogies to the target.
I think the mistake is treating “task duration” as if it were a clean proxy for autonomy. A 16-hour software task can still be strangely narrow. Most real work is not like that: the hard part is not just doing longer chains of steps. It is knowing which problem matters, when the stated problem is wrong, which constraints are political rather than technical, when to stop, when to escalate, and what kind of mistake the organization can actually tolerate.
That is why I find the “longer tasks = imminent autonomy” framing misleading.
Excellent observation. In the real world, work towards a complex task is usually non-linear and a human has to constantly adapt as new info is incorporated. That AI can run for 16 hours straight without breaking might matter in some narrow domains but not most.
Riddle me this Batman, if mythos is now likely said to be the best programmer in the world or in the top ten , how is it that it can only get 50 percent of programming work done correctly under any circumstances? Wouldn't that also mean that the best programmer in the world would only get 50 percent of a task done correctly?
One other question that none of the boosters answer is how much it costs. How much of a datacenter is required to run, say, mythos on some large complex codebase? how many Kwh does that consume? etc. If it takes a gigwatt hour to run the task then it is likely cheaper to have a team of coders
Having said that I thought Mozilla's results show promise - see https://stiennon.substack.com/p/more-mythos-and-mozilla - but it is quite clear that Mozilla is, like ESR that I mentioned in a reply here somewhere, using mythos as part of a process and carefully supervising it. That's what you have to do when a tool is nowhere close to 90% accurate
Human cognition is severely flawed. LLM-based chatbot cognition is severely flawed. It is therefore unsurprising that many humans' perception of LLM-based chatbot cognition is severely flawed.
In almost any human endeavor I can think of, for routine or non-emergency work, the expectation is a very high level of success. If you go to a lawyer, you expect the professional output to be very precise. You might not win the case in a conflict, but for routine contracts, wills, and trusts, you expect certainly more than 90% of perfect. You go to a surgeon, and you expect very low complication rates and more than 80 or 90% success. Therefore, when I read of a benchmark that is 50% success for what are really routine tasks, that is not acceptable. That is a fiction. That implies that much more development work is needed.
Thank you for always bringing reasonable views to the soup of hype.
Lately whenever I see these “AGI is here!” posts it just makes me feel like it’s all propaganda paid for by the marketing teams of the foundation model companies.
They need this steady drumbeat of hype to justify their spending, lack of real progress, and regular product flops. I wouldn’t be surprised if one day, people look back and wonder at how much corporate money was paid to cultural influencers.
But who knows. Maybe in the cycle of time, this generation will just be a strange period of mass-hysteria/delusion. One stop on a long train line.
A rapidly advancing technology needs rapidly advancing terminology. For decades after ignominious AI Winters, the term “AI” was not used much in polite public discourse or grant proposals, replaced by office automation, expert systems, knowledge-based systems, neural networks, workflow management, semantic networks, machine learning, deep learning, etc. Ultra-intelligence and the Singularity were also retired. AGI came along around 1990 to cover quiet research efforts in the area. In 2022/2023 LLMs brought “AI” back into respectability: Generative AI and AGI. AGI was ill-defined, so companies could say it was very close. But there was a problem. If a system is super smart in some topics but makes mistakes a child wouldn’t in others, saying it has general intelligence isn’t credible. Discussion of its imminence seemed to recede. Another challenge is that it was long assumed that AGI would lead to self-educating machines immediately vastly surpassing human intelligence: the Singularity. But our weaker albeit undefined AGI won’t. Hence, for example, Ray Kurzweil predicts AGI by 2029 and the Singularity by 2045. In this post I first encountered ASI, not defined but presumably Artificial Superintelligence. It need not be AGI or the Singularity. For example, Deep Blue had artificial superintelligence in chess, yet it wasn’t especially intelligent in anything else. We may get ASI in some aspects of coding, but that may no more end software engineering than Deep Blue ended people playing chess. “Narrow Superintelligence” may be cooler than “Agents” and excite investors or accelerate public antagonism to AI, but could also be a welcome step toward focusing just on what is actually useful.
While (some?) math can be checked formally for correctness, code generally cannot. If code could be formally verified, Uncle Edsger would probably rise from his grave. One is about as realistic as the other.
Claude code is amazing in some ways but in the end suffers from exactly the same limitations as other uses of stochastical pattern generation.
Even the host of MLST (Tim S.) has said that he has been impressed by how much generalization has been possible via the “pattern matching” approach—and he is far from someone who engages in AI hype. I am regularly impressed and then disappointed by my interactions with these models! I actually find them quite useful; however, though I could be wrong, I suspect that robust generalization will remain elusive until AI starts taking on characteristics associated with living organisms. Arguably, our dogs have a higher degree of general intelligence than the frontier LLMs.
The thing that absolutely fascinates me about the METR graph is just how big the gap between 50% and 80% is
The Mythos preview went up to 18 hours on the 50%, but only 3 hours on 80%. Which means there’s 15 hours of difference between the two!!!! That’s going to leave people with some very inconsistent results!
The 50% versus reliability distinction is the most important methodological point in this piece and it maps directly onto something I keep seeing in practice with teams deploying these tools.
A system that succeeds at 16-hour tasks 50% of the time is simultaneously two completley different products depending on who's using it and how. In a research setting where a human checks the output before anything ships, 50% success is genuinely transformative. You run the agent, review the result, keep the wins and discard the rest. Most developers using Claude Code right now work exactly this way and its why the product feels revolutionary to them even though the raw success rate would terrify an enterprise buyer.
In a production deployment where the output goes directly to a customer, 50% is a disaster. The reliability threshold for autonomous commercial use is somewhere between 95% and 99.9% depending on the domain and the METR graph doesnt tell you anything about how fast that gap is closing because it only measures the 50% line. The gap between 50% and 99% is where the entire "are we close to AGI" debate actually lives and I think thats the single most useful framing anyone can take from this piece.
The log-scale point needs to be said louder. Linear progress on a log scale looks exponential and the visual design of that graph is doing enormous persuasive work that the underlying data doesnt support. Plot the same improvements on a linear y-axis and you get steady incremental gains, impressive but not the hockey stick that triggered the panic.
Your observation about symbolic tools is the one that will age best though. If a big chunk of the improvement comes from models learning to use search and code execution tools rather than from the neural network getting fundamentally smarter then the progress curve hits a ceiling wherever tool integration saturates. And once it does the bottleneck shifts back to core reasoning which is exactly where the limitations youve been documenting for years reassert themselves. Both readings of this graph are probaly correct simultaneously and the honest response is somewhere between the panic and the dismissal.
outstanding comment! especially the point on different modes of use, autonomous vs not.
<Your observation about symbolic tools is the one that will age best though. If a big chunk of the improvement comes from models learning to use search>
I was recently using Google Gemini to try to find movie themes that matched a target movie, e.g., "Find me movies that are like the 'African Queen' (1954) with Bogart and Hepburn". I assumed it would trawl through the movies, read the brief synopsis and other info, and return some movies. To make sure it wasn't hallucinating, I have teh Ai provide details so that I can verify the output. On the first run, it just used Google search. So the next run I asked it to only use IMDB. It returned some good movies with explanations of why they were similar. BUT, it turned out is was using IMDB's Advanced Search to find similar movies. Clever, but it was just using a [symbolic] tool, probably an SQL query to find the movies. IOW, it was doing anything I was looking for to test whether it could reason by analogy, but simply tapped into the movie database using a search function. Gemini even showed me how to do the search (which was nice - I learned something), but it was clearly not trying to find movies by analogies to the target.
I think the mistake is treating “task duration” as if it were a clean proxy for autonomy. A 16-hour software task can still be strangely narrow. Most real work is not like that: the hard part is not just doing longer chains of steps. It is knowing which problem matters, when the stated problem is wrong, which constraints are political rather than technical, when to stop, when to escalate, and what kind of mistake the organization can actually tolerate.
That is why I find the “longer tasks = imminent autonomy” framing misleading.
Excellent observation. In the real world, work towards a complex task is usually non-linear and a human has to constantly adapt as new info is incorporated. That AI can run for 16 hours straight without breaking might matter in some narrow domains but not most.
Riddle me this Batman, if mythos is now likely said to be the best programmer in the world or in the top ten , how is it that it can only get 50 percent of programming work done correctly under any circumstances? Wouldn't that also mean that the best programmer in the world would only get 50 percent of a task done correctly?
You can make an exponential chart of anything if you get to choose any arbitrary points in time for the analysis…
One other question that none of the boosters answer is how much it costs. How much of a datacenter is required to run, say, mythos on some large complex codebase? how many Kwh does that consume? etc. If it takes a gigwatt hour to run the task then it is likely cheaper to have a team of coders
Having said that I thought Mozilla's results show promise - see https://stiennon.substack.com/p/more-mythos-and-mozilla - but it is quite clear that Mozilla is, like ESR that I mentioned in a reply here somewhere, using mythos as part of a process and carefully supervising it. That's what you have to do when a tool is nowhere close to 90% accurate
Human cognition is severely flawed. LLM-based chatbot cognition is severely flawed. It is therefore unsurprising that many humans' perception of LLM-based chatbot cognition is severely flawed.
Isn't the name "Mythos" a tell?
I don't know. I've never had a "thos," so I can't tell.
In almost any human endeavor I can think of, for routine or non-emergency work, the expectation is a very high level of success. If you go to a lawyer, you expect the professional output to be very precise. You might not win the case in a conflict, but for routine contracts, wills, and trusts, you expect certainly more than 90% of perfect. You go to a surgeon, and you expect very low complication rates and more than 80 or 90% success. Therefore, when I read of a benchmark that is 50% success for what are really routine tasks, that is not acceptable. That is a fiction. That implies that much more development work is needed.
Thank you for always bringing reasonable views to the soup of hype.
Lately whenever I see these “AGI is here!” posts it just makes me feel like it’s all propaganda paid for by the marketing teams of the foundation model companies.
They need this steady drumbeat of hype to justify their spending, lack of real progress, and regular product flops. I wouldn’t be surprised if one day, people look back and wonder at how much corporate money was paid to cultural influencers.
But who knows. Maybe in the cycle of time, this generation will just be a strange period of mass-hysteria/delusion. One stop on a long train line.
A rapidly advancing technology needs rapidly advancing terminology. For decades after ignominious AI Winters, the term “AI” was not used much in polite public discourse or grant proposals, replaced by office automation, expert systems, knowledge-based systems, neural networks, workflow management, semantic networks, machine learning, deep learning, etc. Ultra-intelligence and the Singularity were also retired. AGI came along around 1990 to cover quiet research efforts in the area. In 2022/2023 LLMs brought “AI” back into respectability: Generative AI and AGI. AGI was ill-defined, so companies could say it was very close. But there was a problem. If a system is super smart in some topics but makes mistakes a child wouldn’t in others, saying it has general intelligence isn’t credible. Discussion of its imminence seemed to recede. Another challenge is that it was long assumed that AGI would lead to self-educating machines immediately vastly surpassing human intelligence: the Singularity. But our weaker albeit undefined AGI won’t. Hence, for example, Ray Kurzweil predicts AGI by 2029 and the Singularity by 2045. In this post I first encountered ASI, not defined but presumably Artificial Superintelligence. It need not be AGI or the Singularity. For example, Deep Blue had artificial superintelligence in chess, yet it wasn’t especially intelligent in anything else. We may get ASI in some aspects of coding, but that may no more end software engineering than Deep Blue ended people playing chess. “Narrow Superintelligence” may be cooler than “Agents” and excite investors or accelerate public antagonism to AI, but could also be a welcome step toward focusing just on what is actually useful.
What do you think about: the ARC-AGI-3 benchmarks and the work being done by Symbolica and Ben Goertzel?
🤣 "Trillion Pound Baby Fallacy" - that's gold. I am going to start using that.
Where is my 5 minute AGI?! lol
While (some?) math can be checked formally for correctness, code generally cannot. If code could be formally verified, Uncle Edsger would probably rise from his grave. One is about as realistic as the other.
Claude code is amazing in some ways but in the end suffers from exactly the same limitations as other uses of stochastical pattern generation.
Even the host of MLST (Tim S.) has said that he has been impressed by how much generalization has been possible via the “pattern matching” approach—and he is far from someone who engages in AI hype. I am regularly impressed and then disappointed by my interactions with these models! I actually find them quite useful; however, though I could be wrong, I suspect that robust generalization will remain elusive until AI starts taking on characteristics associated with living organisms. Arguably, our dogs have a higher degree of general intelligence than the frontier LLMs.
The thing that absolutely fascinates me about the METR graph is just how big the gap between 50% and 80% is
The Mythos preview went up to 18 hours on the 50%, but only 3 hours on 80%. Which means there’s 15 hours of difference between the two!!!! That’s going to leave people with some very inconsistent results!
50% successful at carefully selected tasks that AI might be able to do, eeash.