Discussion about this post

User's avatar
Scenarica's avatar

The 50% versus reliability distinction is the most important methodological point in this piece and it maps directly onto something I keep seeing in practice with teams deploying these tools.

A system that succeeds at 16-hour tasks 50% of the time is simultaneously two completley different products depending on who's using it and how. In a research setting where a human checks the output before anything ships, 50% success is genuinely transformative. You run the agent, review the result, keep the wins and discard the rest. Most developers using Claude Code right now work exactly this way and its why the product feels revolutionary to them even though the raw success rate would terrify an enterprise buyer.

In a production deployment where the output goes directly to a customer, 50% is a disaster. The reliability threshold for autonomous commercial use is somewhere between 95% and 99.9% depending on the domain and the METR graph doesnt tell you anything about how fast that gap is closing because it only measures the 50% line. The gap between 50% and 99% is where the entire "are we close to AGI" debate actually lives and I think thats the single most useful framing anyone can take from this piece.

The log-scale point needs to be said louder. Linear progress on a log scale looks exponential and the visual design of that graph is doing enormous persuasive work that the underlying data doesnt support. Plot the same improvements on a linear y-axis and you get steady incremental gains, impressive but not the hockey stick that triggered the panic.

Your observation about symbolic tools is the one that will age best though. If a big chunk of the improvement comes from models learning to use search and code execution tools rather than from the neural network getting fundamentally smarter then the progress curve hits a ceiling wherever tool integration saturates. And once it does the bottleneck shifts back to core reasoning which is exactly where the limitations youve been documenting for years reassert themselves. Both readings of this graph are probaly correct simultaneously and the honest response is somewhere between the panic and the dismissal.

Ex-Consultant in Tech's avatar

I think the mistake is treating “task duration” as if it were a clean proxy for autonomy. A 16-hour software task can still be strangely narrow. Most real work is not like that: the hard part is not just doing longer chains of steps. It is knowing which problem matters, when the stated problem is wrong, which constraints are political rather than technical, when to stop, when to escalate, and what kind of mistake the organization can actually tolerate.

That is why I find the “longer tasks = imminent autonomy” framing misleading.

26 more comments...

No posts

Ready for more?