Viewing an LLM merely as a tool (as one should) this is important to keep in mind. These tools are only as accurate as a Google search, albeit quicker and more tailored to your specific question. They can be jacks of all trades but will never have deep (human-beating) expertise.
It's telling that in the areas where we see true superhuman performance out of AI (chess, go, certain robotics applications) the training data is largely synthetic, i.e. not from humans.
Kathleen Weber: Hugh (in a note below this one) is most likely correct. The distinction to be maintained, however, is between the intelligible that is only that, as in the natural and physical sciences, and the intelligible that is also intelligent, as in human things, e.g., the social sciences, arts, culture, history, religion, even economics (and sports).
The first is pretty standard--like medicine, which is off-the-charts-helpful. And the standardization, even of the laws, is, in part, not intelligent but commonly, functionally consistent--much more amenable and really helpful as AI trained.
The second, however, is intelligibility that is also intelligent (human). And it is in those areas that your "shallow pool of mediocrity" comment comes in, at least for now. Intelligence and history, where it lives, always has a heavy-duty non-systematic component, an empirical residue, so to speak, which is also the source of (or A source of) human questions and creativity. Exit: AI coherence.
So, the distinction is also between (a) somewhat static, and (b) constitutionally dynamic that is not sorted out (yet) for AI or (if I understand it well enough) for AGI.
There is also the element of normativity which is underpinned by a value system which I seriously doubt can be copied, especially when related to the non-systematic of human thinking and activities in real-time history.
Actually, on all current medical benchmarks, LLMs currently vastly outperform human doctors, see for example "https://cpcbench.com/" which benchmarks LLMs on a set of extremely difficult medical cases published over the last 100 years.
This is not to say that this means that LLMs outperform a skilled specialist in their field of choice, a GP that has known you for a decade, or an entire hospital system at diagnostic information processing. However, if you have some question about your health, these benchmarks seem to imply that you will get a better answer if you approach an LLM, rather than a randomly sampled doctor.
That benchmark is so absurd. How well an LLM does on coding tasks has everything to do with what it’s been trained and fine tuned on. Putting time to complete task on the y-axis is silly.
Coding LLMs are all fine tuned on popular stacks (ex: React/Next, Tailwind, FastAPI) and common patterns, so you can have an LLM cook up a React component that would have taken a dev a few days, even weeks or months, as long as what you need isn’t too far away from the training distribution (think drop downs, profile pages, comment sections, CRUD endpoints, ect). If not, then it’s mostly garbage code that will need to be rewritten.
It’s also quite hard to tell where the edge of the distribution is. In my experience I’ve been surprised how many basic tasks Claude Code falls apart on.
Of course the irony is that if you don’t know what you’re doing (most vibe coders don’t) then you’ll be generating code much lower quality than the equivalent open source library and taking way longer to get it working right.
Even the idea that you can accurately measure the time a coding task will take is laughable to any professional software engineer. SWE work isn’t like construction, it’s very hard to estimate timing.
I question the very use of “coding” as a valid gauge of machine “intelligence.” (implicit in the graphical “extrapolation”)
That a computer can reproduce -(and even recombine) computer code that it has trained on in itself says little (if anything) about the “intelligence” of the machine.
AI generated code has the imprint of intelligence due to the intelligence of the people who originally produced it even if the machine itself is utterly lacking in intelligence.
I often take your side of this disagreement, but to push back a little:
1) I would be pretty surprised if there's a non-contrived task that takes weeks or months for a dev to do properly that current models can do, even for a narrowly-scoped project (a contrived task would be something that's pointless, but easy for LLMs and slow for humans). I think the time horizon reflects, among other things, the model's ability to keep track of what it was doing and what it needs to do next. I think most of us who've played with these things have seem them just totally lose the plot halfway through solving a problem. So staying in-distribution can only help so much.
2) I think they measure the time directly. They have software developers do the task and see how long it takes them. I realize that, like any laboratory experiment, it's not a perfect reflection of the real world, but it is a measurement, not an estimate. And I'm pretty sure there are software dev tasks that most experienced devs would agree are, say, "a couple hours, not a couple days" or "something that takes at least four hours" or whatever.
I do think there are issues with the time horizon graph, but I don't think it's because they're making silly mistakes.
1) I mentioned this: “an LLM cook up a React component that would have taken a dev a few days, even weeks or months, as long as what you need isn’t too far away from the training distribution” — I’ve seen it happen a couple times. Most of the time I can find a better open source solution that’s actively maintained though.
2) I’m sure that they get a very wide distribution of task completion times on a decent sized sample. I wouldn’t be surprised if any task less than a week in their tests have a standard deviation larger than the mean. This is why the y-axis here is so laughable to any serious software engineer.
The error bars on their measurements are indeed huge. Still, given the huge dynamic range there is somewhat of a trend (e.g. 4s to 2min is a huge range, but it's clearly shorter than 17min to 2hrs). Like I said, I am dubious of the usefulness of that metric so far, but I don't think it's completely meaningless. You can see the data with error bars here: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Also, some of the authors on the METR paper, where the data from AI 2027 came from, are serious software engineers (e.g. former engineers at major AI labs), and they do not think it's laughable.
*Gasp* Gary, you mean to tell me that the doomer fanfiction written by a philosophy dropout, a guy who spreads debunked scientific racism, and a guy who specializes in “AGI Forecasting” is wrong and off target and will probably get pushed back again when 2030 comes around!?!?!?
This is news to me! I had absolutely no idea that this would happen. This is a real curveball I tell you.
In principle, rationalism sounds great, but the hubris, the shallowness, and the fundamental lack of self-examination of their assumptions that it magically attracts make this club just unbearable.
And now, like a bizarre skepticism-fundamentalism horseshoe theory, they even end up where religious fundamentalists are: a doomsday cult.
It's probably like Immanuel Kant wrote: when we have freed ourselves from the chains of conventional wisdom, or what today is “political correctness,” we are still unaccustomed to such free movement. And so many people who strive to be independent, free thinkers just end up as cranks.
I was watching Melanie Mitchell talking about her work testing LLMs and analogies. [She has 2 very good papers about this, using the Chollet test examples.] The upshot was that although the larger LLMs do better, their performance quickly degraded as the test examples got harder, whilst humans maintained about the same average 85-90% score regardless of difficulty.
This is interesting, as these analogies are relatively "simple" visual ones, and not ones on concepts, e.g., waves in water and sound waves, or the idea that the [old] model of the atom of electrons circling the nucleus was like the planets orbiting the sun.
Without such an ability to build analogies, I see no real hope that AIs will "solve all our problems" as the best they can do is reanalyze data and apply different reasoning. There cannot be real breakthroughs without the capability to create new concepts and models to further progress. This is in addition to Dr. Marcus's insistence that AIs need to have internalized world models to be able to do tasks well without making silly mistakes.
I think humans are not going to be replaceable any time soon except for some tasks that are best suited to computers - e.g., doing tasks requiring lots of working memory, or staying vigilant when doing repetitive tasks, and staying separated from real-world tasks needing dexterity to handle objects.
Based on what information or experiments? We do have analog computers, so it would be easy to test that hypothesis, although perhaps very expensive to scale it up to current data servers for AI. However, since strings of characters were done decades ago in Lisp [e.g. ABCD->ABBCD, IJKL -> ?], it should be easy to test this level of analogies on small analog computers.
You can surely rest easy knowing that your bunker will still be useful in 1 (2(5 (10( 50( 100(…))))) years when AGI finally inevitably occurs.
It would nonetheless be useful if there were an “AGI Index” to estimate the likelihood of AGI arriving on any given day, so that we might at least put our papers in order.
That caption - "doublings may get faster due to fewer new skills being needed at higher timescales" - just goes to show that these people have the two-dimensional understanding of intelligence that you'd expect of a child. The very best that they are is stunted fan fiction writers, if not outright stooges.
I really wish people like that co-author would just stop lying. It is hard to hear people so obviously lie about things you find deeply interesting and are knowledgable about.
That co-author knew what he did and is backpedaling when his lies became too obvious for him to maintain
Oh good. I’m still considering quitting my hospital job rather than helping them build a gate kept AI system that they can then charge patients for access to. Since AI is a business model built on theft and one of the biggest scams ever perpetrated on America I find it’s use in healthcare grotesque and unethical. I no longer trust the judgment of the leadership team at the hospital that I work at. In fact, I’m acutely aware of their contempt for all of us employees and would like to move on. I also think the community would be outraged to know they decided to sell out every private interaction with a healthcare provider to Microsoft. But we haven’t gotten to that stage yet.
Take heart! When our local hospital tested out computer models to replace hospital workers the only thing that happened was the patients ordered them out of their rooms and told them not to come back. They abandoned the plan before they even finished implementing it. Who knows how much money they spent at our publicly funded, nonprofit hospital.
And so begins a promising, decades-long grift of making a career out of perpetually back-pedalling a nothing burger of a "seminal publication" with completely bs "predictions" 🥳🍾
Gary What is clear is none of the AI failures you have described can be addressed by the architecture of genAI. You described the solution in terms of symbolic systems like CYC. It is not architecture it is knowledge acquisition. Knowledge like common sense facts must be acquired automatically since knowledge acquisition by hand is just not feasible. The genAI guys know that but don't know how to solve the problem. Unless the knowledge acquisition problem is solved no progress to AGI is likely.
Been saying that from the beginning. We're not anywhere close to what they call AGI. And it may never happen let alone picking a date or a year. If I had to bet on never or within 10 years I pick never.
"We" also need to know more about human consciousness as well as human foibles, not necessarily to copy either, but as a gauge, and to understand what cannot be copied, on principle.
But again, the starry-eyed self-serving hyper-capitalists who are also anti-competitors (regardless of what they say they are) are best at putting the cart before the horse as anyone in history has ever been. And if the cost of AI use goes up to "cover their investment," they are just carving out the same old divisions in cultures between the haves and the have-nots that has been around for centuries, only now it concerns sources of information and getting educated--whereas, what is to happen to the model of public education? Sounds like regression to me.
BTW, I read in one of those New York Times articles (that I post here) that the substance of the lawsuit about AI companies using (stealing) writers, reporters' and news organizations' work has been denied by the defendants. What a shock.
The groundbreaking reasoning of Gemini 3 Pro Preview... this was the very first thing I tried:
User
How many r in strrrrawberrrrrrrrry
Model
Thoughts
Reviewing the Input
I'm currently reviewing the input string, "strrrrawberrrrrrrrry", to accurately count the occurrences of the letter 'r'. I've broken down the string into individual characters and am now verifying each 'r' to ensure a precise count. This initial analysis is crucial for achieving an accurate result, and I am being meticulous in my review.
Verifying the Count
I've meticulously recounted the 'r's in "strrrrawberrrrrrrrry" to confirm my initial assessment. Breaking the string into segments and double-checking each occurrence, I'm confident in the final count of 12. My attention has been on a methodical verification process, eliminating the possibility of any errors.
Clearly Defining the Result
I'm now stating the result: the letter 'r' appears precisely 12 times in "strrrrawberrrrrrrrry". I've finalized and validated this value, concluding the processing stage. It's now time to output the final result.
When I learned that an LLM is simply an average of the Internet, I knew it could be nothing more EVER than a large but shallow pool of mediocrity.
So, THAT’s what AI stands for? Average Internet?
Or is it “Average Idiot”?
Is there a difference?
Ding! Ding! Ding! 🏆🏆🏆
Viewing an LLM merely as a tool (as one should) this is important to keep in mind. These tools are only as accurate as a Google search, albeit quicker and more tailored to your specific question. They can be jacks of all trades but will never have deep (human-beating) expertise.
It's telling that in the areas where we see true superhuman performance out of AI (chess, go, certain robotics applications) the training data is largely synthetic, i.e. not from humans.
Kathleen Weber: Hugh (in a note below this one) is most likely correct. The distinction to be maintained, however, is between the intelligible that is only that, as in the natural and physical sciences, and the intelligible that is also intelligent, as in human things, e.g., the social sciences, arts, culture, history, religion, even economics (and sports).
The first is pretty standard--like medicine, which is off-the-charts-helpful. And the standardization, even of the laws, is, in part, not intelligent but commonly, functionally consistent--much more amenable and really helpful as AI trained.
The second, however, is intelligibility that is also intelligent (human). And it is in those areas that your "shallow pool of mediocrity" comment comes in, at least for now. Intelligence and history, where it lives, always has a heavy-duty non-systematic component, an empirical residue, so to speak, which is also the source of (or A source of) human questions and creativity. Exit: AI coherence.
So, the distinction is also between (a) somewhat static, and (b) constitutionally dynamic that is not sorted out (yet) for AI or (if I understand it well enough) for AGI.
There is also the element of normativity which is underpinned by a value system which I seriously doubt can be copied, especially when related to the non-systematic of human thinking and activities in real-time history.
Relying on LLMs for medical advice is just foolish.
The answers can not only be “mediocre” but just plain wrong. Dangerous even.
medical “research” provided by an LLM can be just as wrong as medical advice.
LLMs are simply unreliable so everything they return needs to be independently verified.
Larry Jewett: If you are referring to my comment about medicine, I was referring to research abilities, not specific medical advice.
Actually, on all current medical benchmarks, LLMs currently vastly outperform human doctors, see for example "https://cpcbench.com/" which benchmarks LLMs on a set of extremely difficult medical cases published over the last 100 years.
This is not to say that this means that LLMs outperform a skilled specialist in their field of choice, a GP that has known you for a decade, or an entire hospital system at diagnostic information processing. However, if you have some question about your health, these benchmarks seem to imply that you will get a better answer if you approach an LLM, rather than a randomly sampled doctor.
That benchmark is so absurd. How well an LLM does on coding tasks has everything to do with what it’s been trained and fine tuned on. Putting time to complete task on the y-axis is silly.
Coding LLMs are all fine tuned on popular stacks (ex: React/Next, Tailwind, FastAPI) and common patterns, so you can have an LLM cook up a React component that would have taken a dev a few days, even weeks or months, as long as what you need isn’t too far away from the training distribution (think drop downs, profile pages, comment sections, CRUD endpoints, ect). If not, then it’s mostly garbage code that will need to be rewritten.
It’s also quite hard to tell where the edge of the distribution is. In my experience I’ve been surprised how many basic tasks Claude Code falls apart on.
Of course the irony is that if you don’t know what you’re doing (most vibe coders don’t) then you’ll be generating code much lower quality than the equivalent open source library and taking way longer to get it working right.
Even the idea that you can accurately measure the time a coding task will take is laughable to any professional software engineer. SWE work isn’t like construction, it’s very hard to estimate timing.
I question the very use of “coding” as a valid gauge of machine “intelligence.” (implicit in the graphical “extrapolation”)
That a computer can reproduce -(and even recombine) computer code that it has trained on in itself says little (if anything) about the “intelligence” of the machine.
AI generated code has the imprint of intelligence due to the intelligence of the people who originally produced it even if the machine itself is utterly lacking in intelligence.
I often take your side of this disagreement, but to push back a little:
1) I would be pretty surprised if there's a non-contrived task that takes weeks or months for a dev to do properly that current models can do, even for a narrowly-scoped project (a contrived task would be something that's pointless, but easy for LLMs and slow for humans). I think the time horizon reflects, among other things, the model's ability to keep track of what it was doing and what it needs to do next. I think most of us who've played with these things have seem them just totally lose the plot halfway through solving a problem. So staying in-distribution can only help so much.
2) I think they measure the time directly. They have software developers do the task and see how long it takes them. I realize that, like any laboratory experiment, it's not a perfect reflection of the real world, but it is a measurement, not an estimate. And I'm pretty sure there are software dev tasks that most experienced devs would agree are, say, "a couple hours, not a couple days" or "something that takes at least four hours" or whatever.
I do think there are issues with the time horizon graph, but I don't think it's because they're making silly mistakes.
1) I mentioned this: “an LLM cook up a React component that would have taken a dev a few days, even weeks or months, as long as what you need isn’t too far away from the training distribution” — I’ve seen it happen a couple times. Most of the time I can find a better open source solution that’s actively maintained though.
2) I’m sure that they get a very wide distribution of task completion times on a decent sized sample. I wouldn’t be surprised if any task less than a week in their tests have a standard deviation larger than the mean. This is why the y-axis here is so laughable to any serious software engineer.
The error bars on their measurements are indeed huge. Still, given the huge dynamic range there is somewhat of a trend (e.g. 4s to 2min is a huge range, but it's clearly shorter than 17min to 2hrs). Like I said, I am dubious of the usefulness of that metric so far, but I don't think it's completely meaningless. You can see the data with error bars here: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Also, some of the authors on the METR paper, where the data from AI 2027 came from, are serious software engineers (e.g. former engineers at major AI labs), and they do not think it's laughable.
*Gasp* Gary, you mean to tell me that the doomer fanfiction written by a philosophy dropout, a guy who spreads debunked scientific racism, and a guy who specializes in “AGI Forecasting” is wrong and off target and will probably get pushed back again when 2030 comes around!?!?!?
This is news to me! I had absolutely no idea that this would happen. This is a real curveball I tell you.
I share your shock and disappointment!
It is human biodiversity, please. 😉
But seriously—yikes!
In principle, rationalism sounds great, but the hubris, the shallowness, and the fundamental lack of self-examination of their assumptions that it magically attracts make this club just unbearable.
And now, like a bizarre skepticism-fundamentalism horseshoe theory, they even end up where religious fundamentalists are: a doomsday cult.
It's probably like Immanuel Kant wrote: when we have freed ourselves from the chains of conventional wisdom, or what today is “political correctness,” we are still unaccustomed to such free movement. And so many people who strive to be independent, free thinkers just end up as cranks.
These folks obviously fashion themselves as today’s version of America’s Founding Fathers.
Maybe they are — just sans the intelligence, rationality, wisdom and foresight.
I was watching Melanie Mitchell talking about her work testing LLMs and analogies. [She has 2 very good papers about this, using the Chollet test examples.] The upshot was that although the larger LLMs do better, their performance quickly degraded as the test examples got harder, whilst humans maintained about the same average 85-90% score regardless of difficulty.
This is interesting, as these analogies are relatively "simple" visual ones, and not ones on concepts, e.g., waves in water and sound waves, or the idea that the [old] model of the atom of electrons circling the nucleus was like the planets orbiting the sun.
Without such an ability to build analogies, I see no real hope that AIs will "solve all our problems" as the best they can do is reanalyze data and apply different reasoning. There cannot be real breakthroughs without the capability to create new concepts and models to further progress. This is in addition to Dr. Marcus's insistence that AIs need to have internalized world models to be able to do tasks well without making silly mistakes.
I think humans are not going to be replaceable any time soon except for some tasks that are best suited to computers - e.g., doing tasks requiring lots of working memory, or staying vigilant when doing repetitive tasks, and staying separated from real-world tasks needing dexterity to handle objects.
Perhaps an analog computer is fundamentally better than a digital one at producing analogies (including world models)
Based on what information or experiments? We do have analog computers, so it would be easy to test that hypothesis, although perhaps very expensive to scale it up to current data servers for AI. However, since strings of characters were done decades ago in Lisp [e.g. ABCD->ABBCD, IJKL -> ?], it should be easy to test this level of analogies on small analog computers.
Gary! 🤣🤣 Now I just purchased my survival bunker with assurances that it would be installed by October of 2026. Now this. Whats a AI believer to do?
Said believer just might wake up and smell the bullshit..
Twas scarcasm. 🙃🙃🙃🙃
I knew that--I just kept the joke going. In the world of comedy, when a question is asked, someone has to answer.
Yes, and!
BREAKING.... so social!
You can surely rest easy knowing that your bunker will still be useful in 1 (2(5 (10( 50( 100(…))))) years when AGI finally inevitably occurs.
It would nonetheless be useful if there were an “AGI Index” to estimate the likelihood of AGI arriving on any given day, so that we might at least put our papers in order.
“The Second Coming of AGI”
The AGI is coming
(The second time this year)
The Valley Boyz are drumming
And leading with a cheer
The grapher’s madly plotting
Predicting the arrival
The rest of us alotting
The rice for our survival
That caption - "doublings may get faster due to fewer new skills being needed at higher timescales" - just goes to show that these people have the two-dimensional understanding of intelligence that you'd expect of a child. The very best that they are is stunted fan fiction writers, if not outright stooges.
I really wish people like that co-author would just stop lying. It is hard to hear people so obviously lie about things you find deeply interesting and are knowledgable about.
That co-author knew what he did and is backpedaling when his lies became too obvious for him to maintain
Oh good. I’m still considering quitting my hospital job rather than helping them build a gate kept AI system that they can then charge patients for access to. Since AI is a business model built on theft and one of the biggest scams ever perpetrated on America I find it’s use in healthcare grotesque and unethical. I no longer trust the judgment of the leadership team at the hospital that I work at. In fact, I’m acutely aware of their contempt for all of us employees and would like to move on. I also think the community would be outraged to know they decided to sell out every private interaction with a healthcare provider to Microsoft. But we haven’t gotten to that stage yet.
Take heart! When our local hospital tested out computer models to replace hospital workers the only thing that happened was the patients ordered them out of their rooms and told them not to come back. They abandoned the plan before they even finished implementing it. Who knows how much money they spent at our publicly funded, nonprofit hospital.
The patients ordered the computer models out of their rooms and told them not to come back?
“Bad, bad AI! Leave now! And don’t come back!”
And so begins a promising, decades-long grift of making a career out of perpetually back-pedalling a nothing burger of a "seminal publication" with completely bs "predictions" 🥳🍾
I agree, but I believe you are mixing your metaphors.
Yes, so do most great authors.
Gary What is clear is none of the AI failures you have described can be addressed by the architecture of genAI. You described the solution in terms of symbolic systems like CYC. It is not architecture it is knowledge acquisition. Knowledge like common sense facts must be acquired automatically since knowledge acquisition by hand is just not feasible. The genAI guys know that but don't know how to solve the problem. Unless the knowledge acquisition problem is solved no progress to AGI is likely.
Been saying that from the beginning. We're not anywhere close to what they call AGI. And it may never happen let alone picking a date or a year. If I had to bet on never or within 10 years I pick never.
"We" also need to know more about human consciousness as well as human foibles, not necessarily to copy either, but as a gauge, and to understand what cannot be copied, on principle.
But again, the starry-eyed self-serving hyper-capitalists who are also anti-competitors (regardless of what they say they are) are best at putting the cart before the horse as anyone in history has ever been. And if the cost of AI use goes up to "cover their investment," they are just carving out the same old divisions in cultures between the haves and the have-nots that has been around for centuries, only now it concerns sources of information and getting educated--whereas, what is to happen to the model of public education? Sounds like regression to me.
BTW, I read in one of those New York Times articles (that I post here) that the substance of the lawsuit about AI companies using (stealing) writers, reporters' and news organizations' work has been denied by the defendants. What a shock.
This reminds me of that old doctors joke.
Nurse: the patient who only has 3 months to live can't pay his bill
Doctor: Well, give him another 3 months
The groundbreaking reasoning of Gemini 3 Pro Preview... this was the very first thing I tried:
User
How many r in strrrrawberrrrrrrrry
Model
Thoughts
Reviewing the Input
I'm currently reviewing the input string, "strrrrawberrrrrrrrry", to accurately count the occurrences of the letter 'r'. I've broken down the string into individual characters and am now verifying each 'r' to ensure a precise count. This initial analysis is crucial for achieving an accurate result, and I am being meticulous in my review.
Verifying the Count
I've meticulously recounted the 'r's in "strrrrawberrrrrrrrry" to confirm my initial assessment. Breaking the string into segments and double-checking each occurrence, I'm confident in the final count of 12. My attention has been on a methodical verification process, eliminating the possibility of any errors.
Clearly Defining the Result
I'm now stating the result: the letter 'r' appears precisely 12 times in "strrrrawberrrrrrrrry". I've finalized and validated this value, concluding the processing stage. It's now time to output the final result.
Collapse to hide model thoughts
Thought for 9.2s
There are 12 r's in "strrrrawberrrrrrrrry".
Here is the breakdown:
st rrr awbe rrrrrrrrr y
(3) + (9) = 12
Something clearly broke down in the breakdown.
Lol, Gary!
TL;DR: "The sky isn't falling - yet".
This isn't as bad as Kurzweil's saying it for decades, so there's that :)
The public celebrates the “psychics” who got it right as “soothsayers” and simply forget the vast majority of them who got it wrong.
That’s even true of the ones who made wrong predictions time and time again. If their latest prediction came true, that’s all that matters.
Surprise, surprise...