Discussion about this post

User's avatar
Scenarica's avatar

The 50% versus reliability distinction is the most important methodological point in this piece and it maps directly onto something I keep seeing in practice with teams deploying these tools.

A system that succeeds at 16-hour tasks 50% of the time is simultaneously two completley different products depending on who's using it and how. In a research setting where a human checks the output before anything ships, 50% success is genuinely transformative. You run the agent, review the result, keep the wins and discard the rest. Most developers using Claude Code right now work exactly this way and its why the product feels revolutionary to them even though the raw success rate would terrify an enterprise buyer.

In a production deployment where the output goes directly to a customer, 50% is a disaster. The reliability threshold for autonomous commercial use is somewhere between 95% and 99.9% depending on the domain and the METR graph doesnt tell you anything about how fast that gap is closing because it only measures the 50% line. The gap between 50% and 99% is where the entire "are we close to AGI" debate actually lives and I think thats the single most useful framing anyone can take from this piece.

The log-scale point needs to be said louder. Linear progress on a log scale looks exponential and the visual design of that graph is doing enormous persuasive work that the underlying data doesnt support. Plot the same improvements on a linear y-axis and you get steady incremental gains, impressive but not the hockey stick that triggered the panic.

Your observation about symbolic tools is the one that will age best though. If a big chunk of the improvement comes from models learning to use search and code execution tools rather than from the neural network getting fundamentally smarter then the progress curve hits a ceiling wherever tool integration saturates. And once it does the bottleneck shifts back to core reasoning which is exactly where the limitations youve been documenting for years reassert themselves. Both readings of this graph are probaly correct simultaneously and the honest response is somewhere between the panic and the dismissal.

TheAISlop's avatar

Riddle me this Batman, if mythos is now likely said to be the best programmer in the world or in the top ten , how is it that it can only get 50 percent of programming work done correctly under any circumstances? Wouldn't that also mean that the best programmer in the world would only get 50 percent of a task done correctly?

12 more comments...

No posts

Ready for more?