Wowed by a new paper I just read and wish I had thought to write myself. Lukas Berglund and others, led by Owain Evans, asked a simple, powerful, elegant question: can LLMs trained on A is B infer automatically that B is A? The shocking (yet, in historical context, see below, unsurprising) answer is no:
"The will to believe in neural networks is frequently so strong that counterevidence is often dismissed or ignored, for much too long."
It is extra strong during hypes, such as the current 'GPT-fever', but it is actually a necessary basic function of human intelligence to have stable convictions and there are good reasons for this (mostly energy-efficiency and speed). But the same effect leads to people ending up in rabbit holes and conspiracy theories through confirmation bias. Such people are not crazy, they are ... human.
Human intelligence (such as that role that existing convictions play on our capacity to make observations and reasoning) is a far more interesting subject than artificial intelligence when discussing artificial intelligence :-). See https://ea.rna.nl/2022/10/24/on-the-psychology-of-architecture-and-the-architecture-of-psychology/ where this is discussed against the background of convictions and attempts to change convictions when advising (business leaders, politicians, etc.) on tech facts.
I do think "if the memorized fact is included in the prompt, models succeed" is unintentionally misleading (the misleading nature of many descriptions is part of my talk in London on Oct 10). In this case: No 'facts' are retained/memorised in LLMs, neither through pre-training, nor through fine-tuning, not even in a prompt/reply sequence. None. So the finding you bring up is not surprising at all. It is extremely to be expected.
"All facts presented by Generative AI — even those that are true — are fictitious. All logic and math is estimated … poorly." — Proposed disclaimer for LLMs, modelled after South Park (also from that upcoming talk). Prompt engineering, plugins, and even the non-LLM safety are all signs that we have to work around the fact that LLMs have no 'understanding' whatsoever (for any reasonable value of 'understanding'). Syntactically, they are near-perfect 'stochastically parrots', semantically, they are 'stochastically constrained hallucinators'.
Q: is 450 90% of 500?
ChatGPT: No, 450 is not 90% of 500. To calculate 90% of 500, you can multiply 500 by 0.90 (which represents 90% as a decimal):
500 * 0.90 = 450
So, 90% of 500 is indeed 450. My initial response was incorrect, and I apologize for any confusion.
This is indeed a remarkable result. We are way too quick and eager to accept the appearance of thinking as evidence of reasoning ability, and too reluctant to admit the less exciting evidence and draw the consequences. This might also help explain why, for all the hype on their ability to crack MBA exams, LLMs have not yet cracked any of the truly hard problems that have so far defeated us. And it implies we need to take a much more sober view on the potential that these models hold to transform our economy and boost productivity and economic growth.
This sentence nails it:
"In neural network discussion, people are often impressed by successes, and pay far too little regard to what failures are trying to tell them."
For LLMs we may need to add in "apparent successes". The LLM outputs something that looks impressive, and a bunch of people fall for the illusion. Then, when given a conceptually identical prompt the LLM outputs something deeply unimpressive that ought to undermine the original claim of success - but somehow that's not the interpretation. For example, the "GPT-4 displays theory of mind!" paper was utterly rubbished by two subsequent papers, but what is the response? It's the #1 go-to defense: "humans make mistakes, too".
This is the most popular rhetorical gambit in AI hype: interpret some AI output as a sign of some great advance (or "emergent" ability), and then when counter-examples are put forward push them aside with "humans make mistakes too". When it gets things right it's displaying human-like intelligence; when it gets things wrong it's displaying human-like flaws. Every piece of output is another sign that it's becoming more human-like!
I keep cautioning my friends and family to beware of the hype. What ChatGPT has done is a parlor trick, albeit a really sophisticated one that does have some serious applications in the real world, but with big limitations. This is one of those big limitations. Great article. Thanks!
Great work! Thanks for being the institutional memory for the current context. Much appreciated by folks like myself new to the field.
It's far simpler to say LLMs can't comprehend that mothers give birth to sons. They don't actually 'know' anything about the world. They've never experienced it and never will.
It's like giving a large book library a voice and asking it to describe the city outside without giving it eyes or legs, or money or family, or feelings.
Great piece. In summary, the ability to generalize is the key to solving AGI. Deep neural networks are inherently incapable of generalizing because function optimization (the gradient learning mechanism of deep learning) is the exact opposite of generalization. No add-on or modification to the deep learning model will solve this problem in my opinion. We need a completely new model of intelligence that is designed from the start with generalization in mind. Even sensors must be designed to generalize.
Unfortunately, generative AI (a DL derivative) is syphoning all the funding from generalization research. This must end.
To me the single most important step in AI that can be taken is to merge LLMs with a concept of "hard ground truth," i.e., the work that's been done at Cycorp and elsewhere. Currently LLMs simply "understand" which words or phrases logically go with one another (i.e., "attention"). But if I understand the notion of Cycorp's body of knowledge correctly ("if someone is a woman's son, then the women's children include that someone") one could dramatically improve LLM accuracy by underpinning them with this BOK. Critically important.
This seems more of a critique of the limits of attention based architecture in transformers rather than LLM's themselves which are a bit more general.
Deep Learning should not really be associated with AGI though, full stop. I don't get why people believe this, unless they have discovered that human knowledge comes from computing softmax functions (which seems like a crazy view to me). What is the definition of AGI?
I say this as someone that started self teaching myself ML a few months ago though...
It’s the Clever Hans effect. These machines will appear intelligent to those who want to see intelligence in it.
Hi Gary! Neat that you had written about this, long before LLMs :)
An LLM cannot ever know what a single word means. People do. First, President, United States - people know the meaning of these. "You just told me that!" is what a human would say, in the Washington example. A system that numerically computes the output integer by integer (not even word by word), unsurprisingly, cannot answer the question.
Meaning doesn't reside in words alone. All that LLMs have access to, are words. That is the source of the disconnect.
The fix to this is for GPT and Bard to be able to invoke a knowledge graph. This should be rather straightforward to do, just as with other tool invocations. LLM itself can also be used to process new info daily and populate or update such knowledge graphs.
More reading on LLM and knowledge graphs: https://arxiv.org/abs/2306.08302
I’m late to this party, but I hope this nonetheless comes to your attention.
I think your point that current LLMs have a difficult time with symmetric relations (“is” in this case) is noteworthy, but it’s also indicative of a misguided and unfortunately common misunderstanding of LLMs, and what we should expect of them.
In general, your post (and other similar critiques) tacitly assumes that LLMs should be logically infallible, unerringly accurate, and exceed human capabilities in every measurable way. (And by implication, if they fall short of this high standard, then they are dangerous, flawed, and need to be curtailed.)
I’ve come to a very different point of view. LLMs are remarkably sophisticated tools that reflect the intricacies of human language AS IT IS ACTUALLY USED, and offer us the ability to mine actionable insights from the accumulation of digital debris we leave behind.
The fact that LLMs aren’t everything to everyone, or can’t do some things that most people assume any computer program should be able to do, is a red herring that may prevent us from realizing the tremendous value they are likely to unlock for society.
These programs are linguistic objects, not reasoning or problem-solving machines. The fact that they can nonetheless frequently perform such tasks is a testament to their incredible depth and power, not a defect to be mocked or derided.
As a specific example, consider your critique that an LLMs can answer “Who is Tom Cruise’s mother?” but can’t answer “Who is Mary Lee Pfeiffer’s son?”. This stems from a misunderstand of how language is actually used in human conversation, and makes a false equivalence between the mathematical concept of “equals” and the much more subtle meaning of the word “is”.
When you ask someone, including an LLM, “Who is Tom Cruise’s mother?”, the form of the question implies you believe that shared context between the conversants is sufficient to disambiguate who you are talking about. Since there’s a single famous person with this name, the LLM, like any other reasonably knowledgeable speaker, correctly assumes you are talking about the Hollywood actor. It (or they) can then try to answer the question using whatever factoids they know about him.
But when you ask “Who is Mary Lee Pfeiffer’s son?”, the assumption about shared context is violated. Virtually no one knows who she is, and any reasonable person’s first reaction is going to be “who?”, “which Mary Lee Pfeiffer are you talking about?”, or “why should I know her?”. This is basically how the LLM responded. I assert that THIS IS THE CORRECT RESPONSE, and in contrast to your conclusion, the pair of questions demonstrate the LLM’s remarkably refined linguistic sense, not to mention its depth of knowledge of popular culture. It doesn’t treat these two individuals as interchangeable variables in an equation, the straw man you set up to make your argument. (FYI my PhD in Computational Linguistics was on this point, may it rest in peace!)
It's worth noting that LLMs only source of knowledge is what other people have said (or written) about the world, without the benefit of any direct information or experience. So their reasoning and problem-solving shortcomings are not necessarily a reflection of inherent flaws, but are more likely due to their lack of any real-world interaction to learn from. As you know, this is likely to change in the near future, rendering such concerns moot.
Today, most people expect computer programs to be precise, accurate, logical, and deterministic. But I believe this perception is about to change. Whether we are ready or not, Generative AI systems are intuitive, creative, as well as linguistically and artistically facile. They also suffer many of the same limitations as humans, in addition to their own unique peccadillos. I’ve been collecting examples of this strange new phenomenon, and hope to write more about this soon. I predict a new field of inquiry will shortly emerge that we might call “the psychology of machine intelligence”, or “machine psychology” for short.
I asked GPT-3.5-turbo "Is Mary Lee Pfeiffer the parent of Tom Cruise?" and it responded "No, Mary Lee Pfeiffer is not the parent of Tom Cruise. Tom Cruise's parents are Thomas Cruise Mapother III and Mary Lee Pfeiffer is likely someone else, potentially a different person with the same name."
I suspect the LLM understands the relationship, but is being extra cautious to not make incorrect inferences, perhaps because of RLHF.
Intriguingly, GPT-4 answered "No, Mary Lee Pfeiffer is not the parent of Tom Cruise. Tom Cruise's parents are named Thomas Mapother III and Mary Lee Pfeiffer." which feels much worse than GPT-3.5's answer.