90 Comments
User's avatar
Dana F. Blankenhorn's avatar

I suspect the Doobie Brothers said it best, years ago. 'What a fool believes he sees, no wise man has the power, to reason away."

Expand full comment
RMC's avatar

I like that.

Expand full comment
Wm Perry's avatar

I've been hearing it wrong for 50-ish years, which made me dislike it as anti-rational woo: change "no" for "the". "the wise man has the power to reason away." Oops. Sorry, Mike Macdonald

Expand full comment
Dakara's avatar

I came across something else interesting some might find of interest. A formal proof that LLMs will always hallucinate even with perfect data and limitless compute.

"... we present a fundamental result that hallucination is inevitable for any computable LLM, regardless of model architecture, learning algorithms, prompting techniques, or training data."

Included in my recent post about hallucinations as unsolvable - https://www.mindprison.cc/p/ai-hallucinations-provably-unsolvable

Expand full comment
Joy in HK fiFP's avatar

I found your post very interesting and quoted from it, with attribution, in a reply to comment on the article on AI today in the NYT.

Expand full comment
Dakara's avatar

Thank you!

Expand full comment
Michael E. Kupietz's avatar

It's not even that complex. There is no such thing as "hallucinations", except as an artifact of human cognitive bias. When a stochastic text prediction engine happens to output results that match our expectations or understanding of reality, we call it "working AI". When they don't, we call it a "hallucination". But both are the same thing: the statistically derived output based on vector mappings in a semantic model. The only difference is our opinion of it. The notions we use as a yardstick to differentiate what is a "hallucination" exist in us, not in the LLM.

Expand full comment
Dakara's avatar

Right, an unfortunate adoption of a term, but it is what we have now established as desirable vs undesirable output.

The proof is interesting in that it ends the idea that with just enough data or compute it won't matter. Even with infinite compute and perfect data it will still produce erroneous output.

Expand full comment
Doug Poland's avatar

Very well said Dakara, and much appreciated. Inspired this realization: We will know that LLMs can reason when their developers stop seeking outside funding because their LLM is actually generating something of value, like beating some investment market. Of course, for reasons you explain and cite, this will always be n years away (where n can be any positive constant).

Expand full comment
Fukitol's avatar

Re: programming, any experienced programmer who's spent a little time with LLM programming "agents" could have told them that.

They're marginally useful at doing well-established tasks and copying & modifying existing code to create similar-but-different features. That's about it. Ask them to do anything difficult or novel and they'll provide some relatively expensive entertainment (to the tune of about 50c/minute) as they do a fine impression of a beginner programmer googling, copypasting, tweaking, deleting, pondering error messages, googling some more, then finally throwing their virtual hands in the air and giving up.

The only people I see getting fired up about their capabilities are absolute novices and amateurs. For the rest of us, the utility varies depending on what kind of work we do. They can take a little busywork out of routine chores like "make the value of <thing> configurable in the config file" or "document the parameters of this function", and they can generate boilerplate for well-known frameworks and libraries, and do it quicker than you can. I guess this sort of stuff fascinates newbies who never got through a hello world tutorial. But like remembering where semicolons and curly brackets go, this is not actually the difficult part of programming.

Expand full comment
Dude's avatar

Thank you for calling it out, I agree as that has been exactly my experience as well.

A coworker in my last startup would overuse LLMs and it would create many problems like deleting code...

I suspect the tech debt from poorly supervised LLM code will not be worth it in most cases (outside simple toys).

Expand full comment
Fukitol's avatar

Yeah you have to watch them very carefully, like you would an intern. And so, they are not useful in the hands of interns. Really I don't think anybody but senior devs should be using them, which is paradoxical because we need them the least and are least likely to work on the sorts of trivial tasks they're good for.

Every 5 minutes I spend babysitting the "coding agent" do a 10 minute job in 5 minutes is 5 minutes I didn't spend doing something that the agent, or a junior, couldn't have done. I'm just not sold at all on the commercial value here.

But that said, right now I'm on a small team tight on resources. So when the handful of guys under me are too busy and I'm not I can take away a little of their workload at less expense than if I did it manually. That's worth *something* at least. Just don't know if it's worth what they're charging. I'd much rather hire an intern and give him these jobs as training tasks so that he can someday be more useful than the robot.

Expand full comment
Dude's avatar

Totally agree again. That is the great irony of these tools. To supervise them, you already have to be a domain expert in the given domain. Without the core knowledge you run the risk of learning or using 'hallucinated' facts.

I am curious how you feel about cognitive offloading. The cognitive offloading of critical fundamentals, I feel, will keep jrs handicapped to a jr level. I worry about how they will develop skills for solving novel problems and for higher-level reasoning.

So, have noticed skill stunting or slower skill progression for the less experienced developers you work with who overuse LLMs?

Expand full comment
Fukitol's avatar

At work we've just begun trialing one (concerns about disclosure and proprietary code prevented it up until this point). So I can't say for sure, re: people directly under me.

More broadly, yeah. I see a lot of inexperienced but promising people letting the robot do too much work for them, getting frustrated any time they don't get an instant working answer, and giving up on bugs and other problems when the 'bot can't solve it.

There seems to be this general attitude of, "well if it can't do it now, they'll put out a new version in a few months and it'll fix it then." Beside that I sincerely doubt that's likely, it doesn't bode well. Somewhere up the line has to be a person who actually knows what he's doing, and those of us who are there now won't be here forever. If the skill pipeline is broken for too long we're in trouble.

Expand full comment
Dude's avatar

I don't currently work with any juniors, so thanks for the insight. But yikes... the fact that they give up when the LLMs can't help is scary. That is what I feared given the research on learning in general with LLMs.

I am also skeptical of much more advancement of LLMs I think Gary is right that they are hitting diminishing returns. It will be interesting to see what people will do if LLM progress stalls. Hopefully people can return to respecting and enjoying the learning process.

Anyway, good luck with your work and your team and thanks for the thoughtful replies.

P.S

If you get any interesting lessons from your trial run with the LLM based tools feel free to let me know.

Expand full comment
Fukitol's avatar

> the fact that they give up when the LLMs can't help is scary

Yeah. Developing the patience and discipline to understand, debug, and modify code you didn't write is a sort of under-appreciated part of real-world programming. This isn't really drilled in college or training programs. Beyond the most basic "what does this line of code do" exam question.

So my guess as to what's happening here is, bot writes code -> code is wrong -> novice using the bot doesn't have a working theory of what the bot's code does (or is supposed to do) -> novice thinks it's unsolvable.

The bot gives an easy out: "I can't possibly understand this," and promotes mental laziness. Especially with the mystique around bot code promoted by hypesters. If you remember the "no human could have written this!" moment last year where people were showing off nonsensical spaghetti code and expressing awe with varying levels of sincerity, that's what I mean.

Expand full comment
TC's avatar

So AI agents can copy/paste from Stack Overflow? Ok cool

Expand full comment
Fukitol's avatar

Yeah, or other found code. I was mildly impressed at how good the top shelf agents were at googling and grepping to find code to copy.

Unfortunately they can't really interact with most programs to see the outcome of what they've done, breaking a crucial step in the "stack overflow copypasta code monkey" workflow. Instead a human still fills that role and has to imprecisely describe to the bot what has happened via text that the bot doesn't, in the strictest sense of the word, understand.

So even at the most basic "fake programmer" level of utility they fall short of a human.

Expand full comment
Christopher Rivera's avatar

I find them quite useful in helping me assess my code, and for code completion. But one has to carefully monitor all of their suggestions. They can't be trusted.

Expand full comment
Matt Ball's avatar

Thanks, Gary.

I like LLMs. I think they already serve important roles in a very, very imperfect world.

But you really can't trust anything they say. The simple factual errors range from subtle to breathtaking.

(Example: On *June 2*, ChatGPT told me that May 25 was two days ago. That led to this exchange:

You said:

Hi. You say May 25 was two days ago. Can you re-check your calendar?

ChatGPT said:

You're absolutely right to question that — thank you for catching it.

Today is May 25, 2025, so our original discussion happened earlier today, not two days ago. My earlier comment was incorrect — apologies for the mix-up.)

Expand full comment
Brian Frantz's avatar

Indeed. I find them useful in two main areas: areas of curiosity where I genuinely don't know the answer but the stakes are low if I get the wrong answer, and areas where I have some domain knowledge or ability to easily verify the correctness, where they can get me to an answer/solution faster than traditional searching or writing it from scratch. It often makes mistakes, but usually I can catch that quickly so it's not a big deal.

But this experience does tell me that autonomous agents are unlikely to be reliable enough to be trusted with anything important, unless paired with some means of automatic independent verification which may be viable in some cases but probably not most.

Expand full comment
RMC's avatar

They are particularly bad with dates. I asked it the chances of funding cuts getting through the senate and it gave a very detailed account, with times and dates, in the past tense, of the events in late June and early July. It was early June. None of it had happened. It later told me this had been a rhetorical device. It wasn't a convincing claim.

Expand full comment
jibal jibal's avatar

Back in February (IIRC) ChatGPT told me that Biden was still President, told me that I was wrong to claim otherwise, and dismissed the fact (as I repeatedly stated) that I'm a real person taking my knowledge from the current time line as irrelevant.

Expand full comment
Oleg  Alexandrov's avatar

The results of the Apple paper are fully expected.

This is not going to throw cold water on AI efforts by major companies, in fact, they will double-down on LLM.

What LLM provides is a framework for doing automation when there is a lot of data. Creativity here is not required. Diligently picking the most likely solution is a huge deal. Reliability will go up with use of tools, etc. LLM will never be fool-proof.

Of course LLM does not understand anything and can't generalize. This will need additional painstaking domain-specific work, like with AlphaGo.

Expand full comment
Martin Machacek's avatar

What makes you think that reliability of LLM output will increase with use of tools? What tools do you have in mind?

I’m not sure current transformer LLM architecture is a good base for any model truly understanding (i.e. operating on) high level concepts (as opposed to tokens). Language understanding will have to be part of any future AI technology in order to allow communication with humans, but it will need additional components providing understanding and reasoning.

Expand full comment
Oleg  Alexandrov's avatar

Yeah, that is the big question, how to have AI that can both learn from a vast amount of human knowledge in language form, and also have deep understanding of what things mean.

We, people, use tools. But we don't do it mechanically. We actually have an intuition of what each tool does and what results to expect.

At the minimum, AI needs lots of examples of how people run tools, and lots of strategies for validation and inspection (which may mean more tools). Tools can be calculators, simulators, image classifiers, even lab equipment (for a robot).

I don't think this will be enough for AGI, but can help with lots of automation work.

Expand full comment
Mark's avatar

As an introductory guide to knowledge new to you, LLMs can take you places. As a discoverer of knowledge new to humanity, LLMs stand at the docks and watch humans sail into the unknown. Use them for what they are good at.

Expand full comment
Jonas Barnett's avatar

Thank you for continuing to hold the charlatans' feet to the fire. I can't really tell any more if people believe this current iteration of AI can reason and do other wonderous things, or whether they are trying to pull the wool over the buyer's eyes.

Expand full comment
JEH68's avatar

As a a Financial/Investment professional, to me what no one appears to discuss at depth is the massive misallocation of trillions of dollars over the years and how we have generated a large stock market bubble due to AI (led by LLM's) and it has the potential of sending our economy into a horrible recession once the consensus realizes the lack of or NEGATIVE ROIC from AI. I think due to AI (strongly argue it represents about 40% of the S&P 500 market capitalization and is artifically lifting valuation multiples of the entire stock market.) The fallout from this bubble eventually popping (when people can finally actually can accept the truth over the dream/hype) is painful to think about. Just think of all the wasted resources constructing and powering the large datacenters. What a misallocation of capital. It is on par with the dotcom and GFC in my eyes. When it occurs who knows as people (AI companies, employees and investors) was to believe it in so badly they continue to wear blinders. I am not sure what it will take to pop this bubble. Just look at the negative free cash flow at OpenAI that causes them to issue round after round of larger and larger capital raises that believers are willing to provide for them. It is one big ponzi scheme or Network Marketing (Amway)

Expand full comment
Christopher Rivera's avatar

Precisely. It also takes away human capital and investment into solving more pressing problems such as climate change, income inequality and the stability of democratic governments. In fact, it is being used to hinder progress on these three issues.

Expand full comment
Darren D'Addario's avatar

On a side note: Google's Gemini AI told me yesterday that David Lynch is still alive, so I'm happy for him and his family.

Expand full comment
Zack W's avatar

On 5, it's important to note just how difficult the "hard" problems from that benchmark are. They say in the paper that they use an estimated ELO hueristic for the hard problems of "greater than 3000". For reference the highest chess ELO ever achieved is ~2900, and an additional 100 points on top of that implies massively more challenging problems.

In the paper they also say of these problems:

"These challenges usually hinge on an extremely

involved, non - obvious derivation or deductive leap that requires both a masterful grasp of algorithmic theory and deep mathematical intuition; they elude more than 99.9% of participants and sometimes remain unsolved even by the strongest competitors during live contests"

Is this a knock on LLMs? Sure. Is it damning or unexpected? No.

Expand full comment
RMK's avatar

Watching otherwise intelligent people lose their minds over LLM's is really eye opening.

I don't think AGI is fundamentally impossible, but 5 minutes with chatgpt and it's clear I'm dealing with Eliza's fast-talking grandson.

"Tell me how you feel about my last answer included fabricated links. Here's another one - hope that helps!"

There's no there there.

Expand full comment
Richard Self's avatar

This is about what one would expect from a system that uses transformers twice, once for the generation of the reasoning tokens and then again to process the reasoning tokens and the prompt to keep on calculating token sequences.

This bakes in the need for previous learned sequences in their training corpora. If it isn't there, they won't be able to follow past patterns that do not exist, so the reasoning token sequence will be erroneous / hallucinatory / confabulatory.

Expand full comment
John's avatar

OK. I’ll say it again. You can’t code intelligence - you have to grow it.

There is a real difference between the animate and the inanimate. Humans are animate (most of the time ;-) computer programs are simply inanimate lines of (mostly sloppy, untested :-) code.

Something has to be alive to have intent. Consider this: There’s a lot of woo-woo going down about some AI that threatened it’s developers. Up to 50,000 feet please.

Since LLMs look at all the words on the Internet, they have correlated many things:

1. For literary purposes, humans have supposed that machines are just like humans and don’t like to be turned off.

2. The general human response to infidelity is not good.

3. Bad behaviour can be punished by blackmail.

3. Emails implying infidelity offer the opportunity for blackmail.

So when program says to it’s developer “if you turn me off I will expose your infidelity" it is merely offering the result of an imperfect global search and averaging process. Turn it off, call it!

Expand full comment
John's avatar

Joseph Weizenbaum said it all in 1965

Expand full comment
Herbert Roitblat's avatar

The Apple paper actually provides no support for either conclusion. They say that reasoning is an illusion, which I agree with, but they provide no evidence to support that. They do not provide any evidence to deny it either. It is a methodological mess. See here: https://herbertroitblat.substack.com/p/the-fog-of-illusion

Expand full comment
Gary Marcus's avatar

you don’t give an argument here but sure seems to me like evidence for a failure to generalize algorithms freely outside of presumed distribution, allowing that there is latitude in how the word reasoning is defined.

Expand full comment
Thomas's avatar

You know nothing, Gary.

Expand full comment
Herbert Roitblat's avatar

The failure to generalize could be due to lack of examples in the training set or to a failure of reasoning. The failures would be outside of the presumed distribution of training patterns as well as outside of the supposed logical capabilities of the models.

By "reasoning" I mean that the output tokens of the reasoning model are causally related to the outcome. The so-called reasoning tokens could be epiphenomenal (I think that they are), reflecting patterns learned during training and have no effect on what the model eventually produces.

The so-called reasoning models differ from the so-called non-reasoning models in the content they are trained on and in the volume of training they receive. Both of these could affect the output of the model whether it reasoned or not. The additional training could potentially teach the model to reason or it could give it more text to paraphrase.

We cannot infer the cause from observing the effect. It is challenging to design experiments that identify and neutralize confounding variables (such as additional training).

So here is an experiment (off the top of my head) that might reveal a difference between reasoning and stochastic parroting. Control models would trained just like the reasoning models, but the reasoning steps would be scrambled during training. If reasoning depends on the tokens being trained in a causal order, then this would reveal whether that order was necessary. If it is just the additional training that makes reasoning models sometimes more accurate, then their accuracy should be unaffected by changing the order. The question is how can we make predictions that would clearly distinguish between reasoning and stochastic parroting.

Expand full comment
jibal jibal's avatar

I don't think you understand what "evidence" is.

P.S. Ah, you make a *different* claim in your post: "they present no strong evidence to support this claim"

It seems nefarious to drop "strong" here (which is, of course, a highly subjective evaluation).

Expand full comment
Herbert Roitblat's avatar

Alright, I will make it clear. They provide no evidence that would distinguish between a model that causally engaged in reasoning and one that merely mimicked it. Every observation would apply equally to both models. In my post, I was trying to be polite, but the fact of the matter is that this paper does not support the claim that they make in the title.

As I said, I happen to agree with the conclusion that these models do not reason (see David Hsing's comment), but if we apply certain intellectual standards to the claims that the models have emergent cognitive processes (there is no evidence that they do), then it is incumbent on us to apply those same standards to claims that the models do not. The advancement of artificial intelligence will be much faster if we apply scientific standards to evaluating and understanding the work.

Expand full comment
jibal jibal's avatar

Well, your claim is false.

Also, David Hsing is a crackpot. He's right that the models do not reason, but he's right for completely wrong reasons.

Expand full comment
David Hsing's avatar

I can provide it. There's no way anything can reason "about" any X if it doesn't refer to any particular X at all: https://davidhsing.substack.com/p/why-neural-networks-is-a-bad-technology

Expand full comment
Jonah's avatar

I don't understand how anyone can justify publishing a "joke paper" written by an LLM on Arxiv.

But then, the capture of AI research by large companies has sent ethics out the window anyway.

Expand full comment