117 Comments
User's avatar
Charles Giancarlo's avatar

I generally object to the use of the word "hallucination" to describe how LLMs work - which would imply that when they do work correctly they mimic a perfectly normal human, but every now and then they suffer an aberration.

LLMs are effectively a statistical model of human written language. As such they are best described by a phrase repeated by Mark Twain (who credited it to Benjamin Disraeli - "There are Lies, Damn Lies, and Statistics".

Effectively, LLMs are providing us with statistical word sequences, untested with a true understanding of logic or the world around us.

Expand full comment
Zahid Bashir's avatar

confabulation

Expand full comment
jibal jibal's avatar

The problem with both of those words is that they suggest mental states, but LLMs don't have mental states.

Expand full comment
Ken Kovar's avatar

that we know of.....😎

Expand full comment
jibal jibal's avatar

No, they don't have mental states. Suggesting otherwise shows a complete failure to understand the technology.

Expand full comment
Larry Jewett's avatar

I believe it is spelled “crackbotulation”

LLMs “crackbotulate” because that’s what crackbots do

Expand full comment
Ken Kovar's avatar

better!

Expand full comment
Joe's avatar

Artificial Information (not Intelligence)

Expand full comment
Paul Jurczak's avatar

LLMs are parroting tidbits from a huge text corpus. Often they happen to be right. Too frequently they are wrong. They are too primitive to hallucinate.

Expand full comment
Ken Kovar's avatar

I hate that word too... it implies some rudimentary consciousness which these dumb programs clearly do not have....

Expand full comment
Andreas Schneider's avatar

In other words, they’re bullshit

Expand full comment
12345's avatar

your summary discounts the effect of their training, which these days is the bulk of the 'ingredients' that are baked into them.

Expand full comment
Shandon F.'s avatar

And yet, the quote was directed at the very human predilection to believe what we want to believe. So, what to make of the humans that proceed without a true understanding of logic or the world around them—or, much worse, a fine understanding but a desire to manipulate to their own advantage?

Expand full comment
khimru's avatar

Average? Normal? Typical?

The chilling truth is that LLMs have shown us the obvious fact that “average human” behaves like LLMs do, most of the time: they also regurgitate words without thinking and, in fact, all marketing tricks exploit that propensity.

But humans, usually, have some area where they are experts – it could be the ability to cook delicious food, or write programs or do many other things, depending on human.

And fallacy around LLMs is related to the fact that after we have managed to make them act like average humans acts in “average” situation, when stakes are low… somehow that should promote it to the entirely different mode of operation that happens when expert actually tries to do expert things… it doesn't work that way!

Expand full comment
Shandon F.'s avatar

I'm not sure I understand your comment 100 percent but I think we basically agree, with the caveat that it sounds like you're giving AI a dominion that it doesn't have, i.e. "'average human' behaves like LLMs do"—it's the inverse, since humans precede AI and AI is trained solely on human-created knowledge. That's the broader point, that we're attaching all of this importance to AI "hallucinations" or, in non-jargon, lies. The question is intent—if it's a mistake of the AI's formula or if it's willful misinformation for the AI's own benefit...which is currently only known to be a human trait.

Expand full comment
khimru's avatar

What I'm saying is that AI does do things in a way humans NORMALLY do: receive words, say some other words that are attached to these words in their subconscious with zero fact-checking or understanding. Without even “turning on consciousness,” as they say. Even article itself says: “as humans sometimes, when well motivated, do”.

But then… the fallacy: if we already have covered that 90% of what humans do… how hard would it be to make AI the remaining 10%, too? Let's make LLMs bigger and consciousness would magically appear!

But that is not happening – and we knew it from the beginning.

Consciousness and subconscious are not just different words, they are using physically different mechanisms in a human brain (as far as we know, anyway).

As for “intent”… dogs and cats don't have consciousness and couldn't program, however… they sure as hell have some goals and intents.

Expand full comment
Shandon F.'s avatar

If the sole focus is to get AI to 100% parity with human consciousness, then yes, we've all been lied to. But my issue is that Marcus' line of dissent is too self-centered—it's an academic engineer's view of the *project plan* of AI. So, "AI is hallucinating and that means that we've been lied to" is technically true. But it's an argument made as objective truth along a timeline of advancement that quickly renders it obsolete at best and willfully misleading at worst. I just wish that, rather than these aloof "I sat down with Harry Shearer and we talked about those silly hallucinations" pieces, we focused on the actual significance of that last 10% and what impact trying to get there has had and what impact it will have as we approach 99%. And, importantly, I just wish there was more context and attention given to the "as humans sometimes, when well-motivated, do," which is holding an awful lot of water for humans as rational actors.

Expand full comment
Martin Machacek's avatar

The main problem with LLMs is that current models cannot provide any information about their confidence in the answer … or simply refuse to provide an answer if there is not enough information. A human (unless they are a bullshitter) may put qualification on the information they provide (something like AFAIK). LLMs provide definitive and (especially to non-experts) plausibly sounding statements which may sometime be entirely wrong. For example a scientist (unless malicious and stupid) won’t fabricate an invalid link to a supporting material.

The remaining 10% (or so) of making LLMs be comparable to skilled humans is going to be hard, because it inevitably requires skills like generalization and other weighting of facts than frequency of occurrence :).

Expand full comment
TheAISlop's avatar

Which country was the actor Albert Einstein born in?

Expand full comment
TheAISlop's avatar

This is one of those prompts I've built to cut to the core of the issue. LLMs are built on probabilities. So even if given a hint like "actor" thet ignore the hint and give more weight to "Albert Einstein". Now find a Michael Jordan who wasn't any NBA legend and see what happens.

Expand full comment
Larry Jewett's avatar

But you must admit that he played the role of wild haired, absent minded professor perfectly.

Expand full comment
Larry Jewett's avatar

Had he not played the part of a physics genius, he could easily have played in Spaghetti Westerns.

Expand full comment
Ken Kovar's avatar

Swissghanistan obviously....😆

Expand full comment
Notorious P.A.T.'s avatar

We just need LLMs to go up to eleven.

Expand full comment
A.J. Sutter's avatar

Stonehenge is a pretty good metaphor for LLM performance vs hype, now that you mention it.

Expand full comment
MarkS's avatar

But they can't carry the one. None of them have managed to learn the rules of arithmetic.

Expand full comment
Larry Jewett's avatar

If they can’t even carry a measly little 1, how can we ever expect them to carry civilization?

Expand full comment
Larry Jewett's avatar

Go to 11?

They can go to LL as far as I am concerned.

Expand full comment
Joe's avatar

That'll be another $Trillion please!

Expand full comment
Jan Steen's avatar

I always wonder how much of the confident, authoritarian tone (encyclopedia-like as you correctly call it) and the grovelling apologetic one when you point out a mistake has been hard-coded. When ChatGPT says "I apologize", is this really something that it came up with spontaneously or is it, as I suspect, something that programmers added?

Who is the 'I' that replies to you in a conversation with an LLM? If my suspicion is correct, this is all rather deceptive, isn't it? You are made to believe that you are talking to an individual who can apologize. It is Eliza on steroids.

Expand full comment
jibal jibal's avatar

The tone is conditioned by the vendor.

Expand full comment
Larry Jewett's avatar

Yes, OpenAI prefers the sycophAIntic tone “You are the smartest person who has ever walked on the earth — or any other planet in the universe”

Expand full comment
Gabriel Risterucci's avatar

Little typo there: "But LLMs still cannot - and on their may own may never be able to — reliably do even things that basic."

Otherwise great piece. It really puts into words the "it's just a statistical machine" position.

Expand full comment
Joy in HK fiFP's avatar

The NYT had an article today about this lacklustre showing by the latest crop of so-called AIs. This article was quite different from the gushing fan-fawning from the resident Ai-promotor, Kevin Roose.

I replied to a commenter questioning the use of "hallucinations" as a descriptor, and suggested the better word was B*llsh*t," and recommended your article from Feb. to them. If it gets past the NYT censor, I think they will be pleased with what you have to say.

Expand full comment
Gerben Wierda's avatar

"LLMs don’t actually know what a nationality, or who Harry Shearer is; they know what words are and they know which words predict which other words in the context of words." It's even worse. They don't even know anything about 'words'. They operate on *tokens* which are mostly meaningless fragments.

I have experienced that explaining 'tokens' clearly to people (e.g. https://youtu.be/9Q3R8G_W0Wc — video — or https://erikjlarson.substack.com/p/gerben-wierda-on-chatgpt-altman-and — text) makes it a lot easier for people to grasp that there is no 'understanding'. All the explanations that use 'word' trigger in humans 'meaningful character sequence'. But the LLMs work on 'meaningless character sequences'. So, explaining with using words is a (Pratchett) "Lie to Children". It is superficially OK, but it isn't true and at some point the lie starts to bite you.

These systems do not make errors (https://ea.rna.nl/2023/11/01/the-hidden-meaning-of-the-errors-of-chatgpt-and-friends/). Every continuation they produce is actually correct from the viewpoint of token statistics. They thus cannot discriminate between right and wrong continuations based on actual meaning/understanding.

The problem is that 'an approximation of the result of (human) understanding' by using (non-understanding) token-statistics isn't 'understanding'. These approximations can be good or bad, but the essence why these issues are unsolvable is *because* it is *fundamentally* an approximation based on weakly related statistics.

Expand full comment
Gerben Wierda's avatar

From that textual explanation (2023):

The question of course is: does it matter that this is how LLMs work? They're still very impressive systems (I agree), after all.

The answer is: yes it matters. The systems are impressive, but we humans are impressionable. We see results that reflect our own qualities (such as linguistic quality), but in reality this quantity has its own quality. And in this case, no amount of scaling is going to solve the fundamental limitations. Simply said: guessing the outcome of logical reasonings based on word-fragment statistics is not really going to work.

Expand full comment
Peter Dorman's avatar

This is important: when we say an LLM commits an "error" or a "hallucination", what do we mean by that? I think it means its output contradicts an overarching rule that limits the space we designate as plausible. But to avoid such errors means being fed all the rules, and there are too many, and in any case that would mean having a process to scan potential rules, with all the opportunities for error in that. Is this all a dead end?

Of course, the practical question is why investors are willing to place immense bets on the performance and profitability of such systems when there is little logic or evidence to justify them.

Expand full comment
Matt Kolbuc's avatar

I've found the hallucinations are so totally unreliable, it's mind boggling how tens of billions of dollars with this much hype were poured into the technology. Developing a NLU engine (free and open source btw: https://cicero.sh/sophia/) for example..

At one point was simply trying to sanitize a list of words I curated. Simply sending batch requests into LLMs saying, "here's 20 words, reply with each word along with a 1 or 0 beside it to identify whether or not it's part of conversational English, or a 0 if it's a non-English word / typo". Ensured good, solid prompt.

It couldn't even do that with reliable confidence, I had to throw the wordlist out. It did get the majority right, but would then mark words like "run", "about", and "what" with a 0, while marking words like "svidnteezigpq" with a 1.

Considering these things are called Large Language Models, keyword being language, and they can't even consistently tell me what is and isn't an English word, that means these things are relegated to the category of pure amusement.

Expand full comment
Danko Nikolic's avatar

I like this.

Expand full comment
Dakara's avatar

Great post! LLMs make it difficult for most people to understand their nature. No matter how many times examples such as this are posted, typically people will counter with "but humans hallucinate too" or something similar.

FYI, I published something this morning specifically for that type of rebuttal that I hope adds more clarity.

"It is such a magnificent pretender of capability. It is just good enough to fully elicit the imagination of what it might be able to do, but never will."

https://www.mindprison.cc/p/intelligence-is-not-pattern-matching-perceiving-the-difference-llm-ai-probability-heuristics-human

Expand full comment
LLM Destiny's avatar

Gary absolutely will own a pet chicken named Henrietta, he cannot escape destiny.

Expand full comment
Jan Blok's avatar

LLM's model language, not the world...

Expand full comment
Sufeitzy's avatar

LLM’s are probabilistic not deterministic. The “temperature” and randomizing seed in most interaction models determines the range of random choices for next token to fulfill fill-in-the-blank type answers like prompt cycles or other interaction.

The “wrong” (statistically available, but deterministically incorrect) answer is simply the design of the GUI. All AI’s I’ve used in software have a “deterministic” or repeatable mode which locks the seed and the generation result.

Even when you force a model to be deterministic all you get is the likeliest path threaded through the underlying Markov “blanket” structure, the mathematical representation of weighting’s in a neural network.

LLM’s unlike consciousness models don’t self-error correct, except at a very gross level - you’ll occasionally see output withdrawn if it violates a higher level guardrail, like discussion of child abuse or murder. Some backtracking is in place, like penalties for repetition, but they are minor and don’t influence global résults.

The correct word is confabulation or false narratives. Hallucination is the result of systems of consciousness which replace sensed reality, not linear generation of text IMHO - they are barely analogous.

Expand full comment
Carter Edsall's avatar

Welcome to the Misinformation Age.

Expand full comment
Jim Hartman's avatar

All this seems to confirm the idea that AGI is a long way off.

Expand full comment
Jonah's avatar

People below who are writing "But I put in this similar query and I got accurate or mostly accurate information" are missing the point, or at least, a very important part of the point.

These models are usually probabilistic and frequently highly sensitive to their input. Putting in the same query will not guarantee the same result. Putting in a similar query certainly will not.

That you can sometimes get a correct output is almost meaningless in itself. Picture a simple classification model for, say, whether someone was alive or dead at 70 that took their most recent weight as an input and that returned 0 or 1 with 50% probability. Or that returned 0 or 1 depending on whether the last digit of the weight was even or odd.

Someone would write an article about all the times that this model was wrong, and the comments would be full of people saying that they had weighed themselves, and the model had correctly predicted that they were alive! Or that they put in their deceased relative's weight, and it correctly predicted that they were dead. Extraordinary!

Transformer models are much better, of course, but the principle is the same. Just knowing that you can get the correct result does not tell you much. If you need to know the truth to know whether the output is true, that is not too useful. Knowing how often you get correct results, and when, is what you really need to know to evaluate the model.

So don't post some correct response that you got. That is not a scientifically useful data point. Really, the bad results are not, either, except of proof of lack of perfection. Now, if you run the same query a few hundred times in a clean environment, ideally with small variations, and you can tell me how often the output was correct—that would be useful to know.

Expand full comment