118 Comments
User's avatar
Kathleen Weber's avatar

When I learned that an LLM is simply an average of the Internet, I knew it could be nothing more EVER than a large but shallow pool of mediocrity.

Expand full comment
Larry Jewett's avatar

So, THAT’s what AI stands for? Average Internet?

Expand full comment
Larry Jewett's avatar

Or is it “Average Idiot”?

Is there a difference?

Expand full comment
Kathleen Weber's avatar

Ding! Ding! Ding! 🏆🏆🏆

Expand full comment
Jack's avatar

Viewing an LLM merely as a tool (as one should) this is important to keep in mind. These tools are only as accurate as a Google search, albeit quicker and more tailored to your specific question. They can be jacks of all trades but will never have deep (human-beating) expertise.

It's telling that in the areas where we see true superhuman performance out of AI (chess, go, certain robotics applications) the training data is largely synthetic, i.e. not from humans.

Expand full comment
C. King's avatar

Kathleen Weber: Hugh (in a note below this one) is most likely correct. The distinction to be maintained, however, is between the intelligible that is only that, as in the natural and physical sciences, and the intelligible that is also intelligent, as in human things, e.g., the social sciences, arts, culture, history, religion, even economics (and sports).

The first is pretty standard--like medicine, which is off-the-charts-helpful. And the standardization, even of the laws, is, in part, not intelligent but commonly, functionally consistent--much more amenable and really helpful as AI trained.

The second, however, is intelligibility that is also intelligent (human). And it is in those areas that your "shallow pool of mediocrity" comment comes in, at least for now. Intelligence and history, where it lives, always has a heavy-duty non-systematic component, an empirical residue, so to speak, which is also the source of (or A source of) human questions and creativity. Exit: AI coherence.

So, the distinction is also between (a) somewhat static, and (b) constitutionally dynamic that is not sorted out (yet) for AI or (if I understand it well enough) for AGI.

There is also the element of normativity which is underpinned by a value system which I seriously doubt can be copied, especially when related to the non-systematic of human thinking and activities in real-time history.

Expand full comment
Larry Jewett's avatar

Relying on LLMs for medical advice is just foolish.

The answers can not only be “mediocre” but just plain wrong. Dangerous even.

Expand full comment
Larry Jewett's avatar

medical “research” provided by an LLM can be just as wrong as medical advice.

LLMs are simply unreliable so everything they return needs to be independently verified.

Expand full comment
C. King's avatar

Larry Jewett: If you are referring to my comment about medicine, I was referring to research abilities, not specific medical advice.

Expand full comment
Sebastian Tangen Alfsen's avatar

Actually, on all current medical benchmarks, LLMs currently vastly outperform human doctors, see for example "https://cpcbench.com/" which benchmarks LLMs on a set of extremely difficult medical cases published over the last 100 years.

This is not to say that this means that LLMs outperform a skilled specialist in their field of choice, a GP that has known you for a decade, or an entire hospital system at diagnostic information processing. However, if you have some question about your health, these benchmarks seem to imply that you will get a better answer if you approach an LLM, rather than a randomly sampled doctor.

Expand full comment
hugh's avatar

That benchmark is so absurd. How well an LLM does on coding tasks has everything to do with what it’s been trained and fine tuned on. Putting time to complete task on the y-axis is silly.

Coding LLMs are all fine tuned on popular stacks (ex: React/Next, Tailwind, FastAPI) and common patterns, so you can have an LLM cook up a React component that would have taken a dev a few days, even weeks or months, as long as what you need isn’t too far away from the training distribution (think drop downs, profile pages, comment sections, CRUD endpoints, ect). If not, then it’s mostly garbage code that will need to be rewritten.

It’s also quite hard to tell where the edge of the distribution is. In my experience I’ve been surprised how many basic tasks Claude Code falls apart on.

Of course the irony is that if you don’t know what you’re doing (most vibe coders don’t) then you’ll be generating code much lower quality than the equivalent open source library and taking way longer to get it working right.

Even the idea that you can accurately measure the time a coding task will take is laughable to any professional software engineer. SWE work isn’t like construction, it’s very hard to estimate timing.

Expand full comment
Larry Jewett's avatar

I question the very use of “coding” as a valid gauge of machine “intelligence.” (implicit in the graphical “extrapolation”)

That a computer can reproduce -(and even recombine) computer code that it has trained on in itself says little (if anything) about the “intelligence” of the machine.

AI generated code has the imprint of intelligence due to the intelligence of the people who originally produced it even if the machine itself is utterly lacking in intelligence.

Expand full comment
Meefburger's avatar

I often take your side of this disagreement, but to push back a little:

1) I would be pretty surprised if there's a non-contrived task that takes weeks or months for a dev to do properly that current models can do, even for a narrowly-scoped project (a contrived task would be something that's pointless, but easy for LLMs and slow for humans). I think the time horizon reflects, among other things, the model's ability to keep track of what it was doing and what it needs to do next. I think most of us who've played with these things have seem them just totally lose the plot halfway through solving a problem. So staying in-distribution can only help so much.

2) I think they measure the time directly. They have software developers do the task and see how long it takes them. I realize that, like any laboratory experiment, it's not a perfect reflection of the real world, but it is a measurement, not an estimate. And I'm pretty sure there are software dev tasks that most experienced devs would agree are, say, "a couple hours, not a couple days" or "something that takes at least four hours" or whatever.

I do think there are issues with the time horizon graph, but I don't think it's because they're making silly mistakes.

Expand full comment
hugh's avatar

1) I mentioned this: “an LLM cook up a React component that would have taken a dev a few days, even weeks or months, as long as what you need isn’t too far away from the training distribution” — I’ve seen it happen a couple times. Most of the time I can find a better open source solution that’s actively maintained though.

2) I’m sure that they get a very wide distribution of task completion times on a decent sized sample. I wouldn’t be surprised if any task less than a week in their tests have a standard deviation larger than the mean. This is why the y-axis here is so laughable to any serious software engineer.

Expand full comment
Meefburger's avatar

The error bars on their measurements are indeed huge. Still, given the huge dynamic range there is somewhat of a trend (e.g. 4s to 2min is a huge range, but it's clearly shorter than 17min to 2hrs). Like I said, I am dubious of the usefulness of that metric so far, but I don't think it's completely meaningless. You can see the data with error bars here: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Also, some of the authors on the METR paper, where the data from AI 2027 came from, are serious software engineers (e.g. former engineers at major AI labs), and they do not think it's laughable.

Expand full comment
Evan Wayne Miller's avatar

*Gasp* Gary, you mean to tell me that the doomer fanfiction written by a philosophy dropout, a guy who spreads debunked scientific racism, and a guy who specializes in “AGI Forecasting” is wrong and off target and will probably get pushed back again when 2030 comes around!?!?!?

This is news to me! I had absolutely no idea that this would happen. This is a real curveball I tell you.

Expand full comment
Kathleen Weber's avatar

I share your shock and disappointment!

Expand full comment
PH's avatar

It is human biodiversity, please. 😉

But seriously—yikes!

In principle, rationalism sounds great, but the hubris, the shallowness, and the fundamental lack of self-examination of their assumptions that it magically attracts make this club just unbearable.

And now, like a bizarre skepticism-fundamentalism horseshoe theory, they even end up where religious fundamentalists are: a doomsday cult.

It's probably like Immanuel Kant wrote: when we have freed ourselves from the chains of conventional wisdom, or what today is “political correctness,” we are still unaccustomed to such free movement. And so many people who strive to be independent, free thinkers just end up as cranks.

Expand full comment
Larry Jewett's avatar

These folks obviously fashion themselves as today’s version of America’s Founding Fathers.

Maybe they are — just sans the intelligence, rationality, wisdom and foresight.

Expand full comment
Alex Tolley's avatar

I was watching Melanie Mitchell talking about her work testing LLMs and analogies. [She has 2 very good papers about this, using the Chollet test examples.] The upshot was that although the larger LLMs do better, their performance quickly degraded as the test examples got harder, whilst humans maintained about the same average 85-90% score regardless of difficulty.

This is interesting, as these analogies are relatively "simple" visual ones, and not ones on concepts, e.g., waves in water and sound waves, or the idea that the [old] model of the atom of electrons circling the nucleus was like the planets orbiting the sun.

Without such an ability to build analogies, I see no real hope that AIs will "solve all our problems" as the best they can do is reanalyze data and apply different reasoning. There cannot be real breakthroughs without the capability to create new concepts and models to further progress. This is in addition to Dr. Marcus's insistence that AIs need to have internalized world models to be able to do tasks well without making silly mistakes.

I think humans are not going to be replaceable any time soon except for some tasks that are best suited to computers - e.g., doing tasks requiring lots of working memory, or staying vigilant when doing repetitive tasks, and staying separated from real-world tasks needing dexterity to handle objects.

Expand full comment
Larry Jewett's avatar

Perhaps an analog computer is fundamentally better than a digital one at producing analogies (including world models)

Expand full comment
Alex Tolley's avatar

Based on what information or experiments? We do have analog computers, so it would be easy to test that hypothesis, although perhaps very expensive to scale it up to current data servers for AI. However, since strings of characters were done decades ago in Lisp [e.g. ABCD->ABBCD, IJKL -> ?], it should be easy to test this level of analogies on small analog computers.

Expand full comment
TheAISlop's avatar

Gary! 🤣🤣 Now I just purchased my survival bunker with assurances that it would be installed by October of 2026. Now this. Whats a AI believer to do?

Expand full comment
Kathleen Weber's avatar

Said believer just might wake up and smell the bullshit..

Expand full comment
TheAISlop's avatar

Twas scarcasm. 🙃🙃🙃🙃

Expand full comment
Kathleen Weber's avatar

I knew that--I just kept the joke going. In the world of comedy, when a question is asked, someone has to answer.

Expand full comment
Gary Marcus's avatar

Yes, and!

Expand full comment
TheAISlop's avatar

BREAKING.... so social!

Expand full comment
Larry Jewett's avatar

You can surely rest easy knowing that your bunker will still be useful in 1 (2(5 (10( 50( 100(…))))) years when AGI finally inevitably occurs.

It would nonetheless be useful if there were an “AGI Index” to estimate the likelihood of AGI arriving on any given day, so that we might at least put our papers in order.

Expand full comment
Larry Jewett's avatar

“The Second Coming of AGI”

The AGI is coming

(The second time this year)

The Valley Boyz are drumming

And leading with a cheer

The grapher’s madly plotting

Predicting the arrival

The rest of us alotting

The rice for our survival

Expand full comment
Victoria Sponge's avatar

That caption - "doublings may get faster due to fewer new skills being needed at higher timescales" - just goes to show that these people have the two-dimensional understanding of intelligence that you'd expect of a child. The very best that they are is stunted fan fiction writers, if not outright stooges.

Expand full comment
Jared's avatar
5dEdited

I really wish people like that co-author would just stop lying. It is hard to hear people so obviously lie about things you find deeply interesting and are knowledgable about.

That co-author knew what he did and is backpedaling when his lies became too obvious for him to maintain

Expand full comment
Jimmy Roe's avatar

Oh good. I’m still considering quitting my hospital job rather than helping them build a gate kept AI system that they can then charge patients for access to. Since AI is a business model built on theft and one of the biggest scams ever perpetrated on America I find it’s use in healthcare grotesque and unethical. I no longer trust the judgment of the leadership team at the hospital that I work at. In fact, I’m acutely aware of their contempt for all of us employees and would like to move on. I also think the community would be outraged to know they decided to sell out every private interaction with a healthcare provider to Microsoft. But we haven’t gotten to that stage yet.

Expand full comment
Mehdididit's avatar

Take heart! When our local hospital tested out computer models to replace hospital workers the only thing that happened was the patients ordered them out of their rooms and told them not to come back. They abandoned the plan before they even finished implementing it. Who knows how much money they spent at our publicly funded, nonprofit hospital.

Expand full comment
Larry Jewett's avatar

The patients ordered the computer models out of their rooms and told them not to come back?

“Bad, bad AI! Leave now! And don’t come back!”

Expand full comment
Former Philosopher's avatar

And so begins a promising, decades-long grift of making a career out of perpetually back-pedalling a nothing burger of a "seminal publication" with completely bs "predictions" 🥳🍾

Expand full comment
Larry Jewett's avatar

I agree, but I believe you are mixing your metaphors.

Expand full comment
Former Philosopher's avatar

Yes, so do most great authors.

Expand full comment
George Burch's avatar

Gary What is clear is none of the AI failures you have described can be addressed by the architecture of genAI. You described the solution in terms of symbolic systems like CYC. It is not architecture it is knowledge acquisition. Knowledge like common sense facts must be acquired automatically since knowledge acquisition by hand is just not feasible. The genAI guys know that but don't know how to solve the problem. Unless the knowledge acquisition problem is solved no progress to AGI is likely.

Expand full comment
Chris McElroy's avatar

Been saying that from the beginning. We're not anywhere close to what they call AGI. And it may never happen let alone picking a date or a year. If I had to bet on never or within 10 years I pick never.

Expand full comment
C. King's avatar

"We" also need to know more about human consciousness as well as human foibles, not necessarily to copy either, but as a gauge, and to understand what cannot be copied, on principle.

But again, the starry-eyed self-serving hyper-capitalists who are also anti-competitors (regardless of what they say they are) are best at putting the cart before the horse as anyone in history has ever been. And if the cost of AI use goes up to "cover their investment," they are just carving out the same old divisions in cultures between the haves and the have-nots that has been around for centuries, only now it concerns sources of information and getting educated--whereas, what is to happen to the model of public education? Sounds like regression to me.

BTW, I read in one of those New York Times articles (that I post here) that the substance of the lawsuit about AI companies using (stealing) writers, reporters' and news organizations' work has been denied by the defendants. What a shock.

Expand full comment
Norman Fischer's avatar

This reminds me of that old doctors joke.

Nurse: the patient who only has 3 months to live can't pay his bill

Doctor: Well, give him another 3 months

Expand full comment
HALucinate's avatar

The groundbreaking reasoning of Gemini 3 Pro Preview... this was the very first thing I tried:

User

How many r in strrrrawberrrrrrrrry

Model

Thoughts

Reviewing the Input

I'm currently reviewing the input string, "strrrrawberrrrrrrrry", to accurately count the occurrences of the letter 'r'. I've broken down the string into individual characters and am now verifying each 'r' to ensure a precise count. This initial analysis is crucial for achieving an accurate result, and I am being meticulous in my review.

Verifying the Count

I've meticulously recounted the 'r's in "strrrrawberrrrrrrrry" to confirm my initial assessment. Breaking the string into segments and double-checking each occurrence, I'm confident in the final count of 12. My attention has been on a methodical verification process, eliminating the possibility of any errors.

Clearly Defining the Result

I'm now stating the result: the letter 'r' appears precisely 12 times in "strrrrawberrrrrrrrry". I've finalized and validated this value, concluding the processing stage. It's now time to output the final result.

Collapse to hide model thoughts

Thought for 9.2s

There are 12 r's in "strrrrawberrrrrrrrry".

Here is the breakdown:

st rrr awbe rrrrrrrrr y

(3) + (9) = 12

Expand full comment
Larry Jewett's avatar

Something clearly broke down in the breakdown.

Expand full comment
Saty Chary's avatar

Lol, Gary!

TL;DR: "The sky isn't falling - yet".

This isn't as bad as Kurzweil's saying it for decades, so there's that :)

Expand full comment
Larry Jewett's avatar

The public celebrates the “psychics” who got it right as “soothsayers” and simply forget the vast majority of them who got it wrong.

That’s even true of the ones who made wrong predictions time and time again. If their latest prediction came true, that’s all that matters.

Expand full comment
Jim Skelton's avatar

Surprise, surprise...

Expand full comment