"LLMs break down on anything that wasn’t in their training data. Because they’re 100% memorization."
If this doesn't elucidate why the bots can't produce new or original creative work, nothing will. Am growing weary of people arguing that "humans create the same way as LLMs." We might need a new vocabulary to better define artistry, creativity, and mastery.
This kind of assessment is mostly based on your intuitions instead of clear definitions. Intuition is powerful, and - in some cases - does not easily allow to embrace scientific truth. Consider the example of the question "what is life". People's intiuitions have, for many decades, resisted the idea that life is nothing more than a set of interrelated chemical reactions. Instead, the vague notion of a life substance, a "vital force" or "vital spark" has been assumed.
The situation is now very similar. In a few decades people will have no difficulty to accept that human creativity is just based on computational processes based on input data, nothing more. And in this aspect it is a mechanical process without a specially human "creativity force" or "creativity spark". This view of mine is called the computational theory of mind.
Very interesting study, thanks for sharing. That's going to be my morning read.
PS. I love how Chollet writes and thinks about this stuff, very lucid and in way that everybody can understand. (Very similar to your style). His analogy of the LLM as a "program database" is a great mental model to understand what they're doing. I quote: "Prompt engineering is the process of searching through program space to find the program that empirically seems to perform best on your target task. It's no different than trying different keywords when doing a Google search for a piece of software."
Francois Chollet's response is priceless. Thanks for sharing.
Sometimes I wonder if by saying AGI is already here (e.g., Norvig) or that LLMs understand/are intelligent (e.g. Hinton), some who have gained considerable stature in the field are trying to justify having spent their careers developing complex functions with input-output relationships that map to input-output relationships of specific human behaviors, absent making any inroads toward AGI. And so they're moving the goalposts with respect to what counts as AGI or understanding/intelligence.
I think the most convincing proof that LLMs actually do "store" training data was the recent paper (https://arxiv.org/abs/2311.17035) in which asking the model to repeat a single word forever exposed parts of the training data verbatim.
Funny how folks arguing against memorization kind of missed that?
Hi Gary! True 'understanding' wouldn't need words, images, sounds, equations, rules or data, it would just need direct experience with the world - which requires a body. LLMs (all AI in fact) in contrast is entirely dependent on them (words, etc) - ie, no symbols, no dice. That disparity is what's evident, over and over. It is nothing but "understanding" the world entirely in terms of human-originated symbols, with zero grounding of any of it. At best it's 'understanding' patterns in input, without regards to underlying meaning.
'One hand' makes sense to entities that themselves have hands, not to disembodied calculations.
More generally, "making sense" has 'sense' in it for a reason. Common sense is sensed, not grokked via a knowledge-base, not inferred solely from abstract reasoning.
Right, "making sense" for us thinking humans has "sense" in it. But that's a typo as far as the LLMs are concerned... the proper spelling is the way Big Tech thinks of it: "making cents."
LLM could be useful however in giving a robot ideas to try. But it has to have a closed loop, such as in experimenting based on the knowledge, and drawing conclusions.
LLM is interpolation in the highly irregular space of text paragraphs. A very powerful technique, which works very, very well. It can also do symbol substitution, and play along by imitation.
LLM are just the first step. It is equivalent to typing some software, before compilation, debugging, testing, iterations, refinements.
LLM is not nothing, and we will find many uses for it in complex systems.
Chollet says: "LLMs = 100% memorization. There is no other mechanism at work."
Challenge 1: is there no processing of these memories? If you admit there is some processing, then the statement above is just bombastic, and it should rather read "LLMs rely much more on memorization than humans. Only humans have complex processing." And then discussion would shift to the details of processing.
Challenge 2: Don't humans also rely mostly on memorization? How is the role of memorization in LLMs different from the one in humans? To explore this, we can dig into the role of memorization of humans. This is my thesis: no human ever is capable to think any original thought. Any thought a human thinks is based on memorization or on randomness. Your thoughts originate in the memories you got through your eyes, noses, skin. And your thoughts originate from the thoughts you got from others verbally or in written form. Just like for LLMs. And then both humans and LLMs do process these memories. The question is, what are the differences in this processing. There seems to be no profound difference in the role of memorization.
I would say that the understanding thing is a done deal. But apparently, people are easily convinced by how fluent the language is, because that efficient but superficial way is how we humans in a sense have learned to detect quality. Turing was *so* wrong with his Turing test.
These examples are pretty convincing in making the case that LLM output is what you've called pastiche. Shouldn't this be provable, though? If you take the training dataset of DallE or GPT4, and mask out specific images or text categories, would the resultant model be able to handle prompts related to the masked out training data?
Of course, such tests would require companies to actually expose their input data set, which they conveniently do not do. But I wonder if there are open source models with known training sets which can be used for such studies.
1) Your assumption of frequency is half way there. It’s frequency X weighting. Altman’s dismissal of the value of quality training data is directly contradicted by GPT3 weighing quality datasets 10x or more. Same with Stable Diffusion and the Laion-aesthetic dataset, and even more so Midjourney and the artist hit list, as well as those screencaps. Which speaks directly to the fair use substantiality factor.
2) the tear drop video game controller is such a good example of what happens where, which seems to baffle so many:
• the UNDERSTANDING that a tear drop is a shape fit for a human hand came from the operator.
• the IDEA to combine those elements came from — and could only have come from — the operator.
• the ability to EXPRESS that idea as an apt prompt came from the operator.
• the ability to VISUALLY COMBINE those ideas in an image came from the AI system.
• the ability to VISUALLY EXPRESS those fundamental ideas came from the training data.
Your analysis sounds very much like the early chaos researchers experienced (like cloud formation leading to learning that faucet drips are chaotic). One conclusion was that predicting future weather events inherently relies on the model having perfect and total knowledge about all of the prior and current events affecting the weather everywhere, which, of course, is impossible in an absolute sense. Today’s weather models use vast amounts of information gleaned worldwide from a multitude of sources, and they’re only good for a relatively short period of time. (There are many other examples showing the same things.)
In LLM AI models, their limitations appear to me to be not enough “knowledge”. It would seem that the learning and training data sets need to be not only much larger and encompassing a broader range of subject matter, but also include a chaotic “emulator” that allows randomness to impact its direction and choices.
One aspect of the early chaos research was that the then computer chips truncated data at 22 or 23 significant digits to the right of the decimal point because they couldn’t physically handle any more. The prediction results were dramatically and randomly impacted by that unintentional truncation. In a perfect world in AI models there would be no truncation and, in theory, the model could use an “infinite” number of significant digits. But, perhaps, none of these things are in play!
These are merely the musings of a guy who is interested in seeing AI work as hoped. Maybe it could solve some of the dysfunction we hear and see every day. Or, make the dreams of Sci Fi travel a reality. Or….???
"how sensitive LLMs are to minor perturbations" ... this is not related to LLMs but it is related to how AI can achieve outstanding results in conjunction with evidence of a complete lack of understanding ... in this case the adversarial example are from the game of go: https://goattack.far.ai/game-analysis#contents ... I find it likely that you have seen this, but if not it is quite interesting ... a simple adversarial strategy that is weaker than a beginning human amateur reliably beats the strongest Go AI ... the game at the link above shows how the trick works and makes clear that the AI does not understand (the way humans do) the game of Go.
"LLMs break down on anything that wasn’t in their training data. Because they’re 100% memorization."
If this doesn't elucidate why the bots can't produce new or original creative work, nothing will. Am growing weary of people arguing that "humans create the same way as LLMs." We might need a new vocabulary to better define artistry, creativity, and mastery.
I've taken to describing things as "hand-crafted" vs. "AI-generated." Generation is not the same thing as crafting.
Not by a long shot.
This kind of assessment is mostly based on your intuitions instead of clear definitions. Intuition is powerful, and - in some cases - does not easily allow to embrace scientific truth. Consider the example of the question "what is life". People's intiuitions have, for many decades, resisted the idea that life is nothing more than a set of interrelated chemical reactions. Instead, the vague notion of a life substance, a "vital force" or "vital spark" has been assumed.
The situation is now very similar. In a few decades people will have no difficulty to accept that human creativity is just based on computational processes based on input data, nothing more. And in this aspect it is a mechanical process without a specially human "creativity force" or "creativity spark". This view of mine is called the computational theory of mind.
Very interesting study, thanks for sharing. That's going to be my morning read.
PS. I love how Chollet writes and thinks about this stuff, very lucid and in way that everybody can understand. (Very similar to your style). His analogy of the LLM as a "program database" is a great mental model to understand what they're doing. I quote: "Prompt engineering is the process of searching through program space to find the program that empirically seems to perform best on your target task. It's no different than trying different keywords when doing a Google search for a piece of software."
Francois Chollet's response is priceless. Thanks for sharing.
Sometimes I wonder if by saying AGI is already here (e.g., Norvig) or that LLMs understand/are intelligent (e.g. Hinton), some who have gained considerable stature in the field are trying to justify having spent their careers developing complex functions with input-output relationships that map to input-output relationships of specific human behaviors, absent making any inroads toward AGI. And so they're moving the goalposts with respect to what counts as AGI or understanding/intelligence.
>loves it when an argument comes together.
Deep cut!
Great research and insights
I think the most convincing proof that LLMs actually do "store" training data was the recent paper (https://arxiv.org/abs/2311.17035) in which asking the model to repeat a single word forever exposed parts of the training data verbatim.
Funny how folks arguing against memorization kind of missed that?
It’s just absurd
Hi Gary! True 'understanding' wouldn't need words, images, sounds, equations, rules or data, it would just need direct experience with the world - which requires a body. LLMs (all AI in fact) in contrast is entirely dependent on them (words, etc) - ie, no symbols, no dice. That disparity is what's evident, over and over. It is nothing but "understanding" the world entirely in terms of human-originated symbols, with zero grounding of any of it. At best it's 'understanding' patterns in input, without regards to underlying meaning.
'One hand' makes sense to entities that themselves have hands, not to disembodied calculations.
More generally, "making sense" has 'sense' in it for a reason. Common sense is sensed, not grokked via a knowledge-base, not inferred solely from abstract reasoning.
Right, "making sense" for us thinking humans has "sense" in it. But that's a typo as far as the LLMs are concerned... the proper spelling is the way Big Tech thinks of it: "making cents."
😎
Birgitte, bingo :)
LLM could be useful however in giving a robot ideas to try. But it has to have a closed loop, such as in experimenting based on the knowledge, and drawing conclusions.
LLM is interpolation in the highly irregular space of text paragraphs. A very powerful technique, which works very, very well. It can also do symbol substitution, and play along by imitation.
LLM are just the first step. It is equivalent to typing some software, before compilation, debugging, testing, iterations, refinements.
LLM is not nothing, and we will find many uses for it in complex systems.
Chollet says: "LLMs = 100% memorization. There is no other mechanism at work."
Challenge 1: is there no processing of these memories? If you admit there is some processing, then the statement above is just bombastic, and it should rather read "LLMs rely much more on memorization than humans. Only humans have complex processing." And then discussion would shift to the details of processing.
Challenge 2: Don't humans also rely mostly on memorization? How is the role of memorization in LLMs different from the one in humans? To explore this, we can dig into the role of memorization of humans. This is my thesis: no human ever is capable to think any original thought. Any thought a human thinks is based on memorization or on randomness. Your thoughts originate in the memories you got through your eyes, noses, skin. And your thoughts originate from the thoughts you got from others verbally or in written form. Just like for LLMs. And then both humans and LLMs do process these memories. The question is, what are the differences in this processing. There seems to be no profound difference in the role of memorization.
The video game controller was again very funny. Laughing out loud. You're on a roll.
A nice example about small changes (order) having an effect on 'hallucination' is here: https://ea.rna.nl/2023/11/01/the-hidden-meaning-of-the-errors-of-chatgpt-and-friends/
I would say that the understanding thing is a done deal. But apparently, people are easily convinced by how fluent the language is, because that efficient but superficial way is how we humans in a sense have learned to detect quality. Turing was *so* wrong with his Turing test.
These examples are pretty convincing in making the case that LLM output is what you've called pastiche. Shouldn't this be provable, though? If you take the training dataset of DallE or GPT4, and mask out specific images or text categories, would the resultant model be able to handle prompts related to the masked out training data?
Of course, such tests would require companies to actually expose their input data set, which they conveniently do not do. But I wonder if there are open source models with known training sets which can be used for such studies.
It's a well known character trait, at least according to Bio's I've read of Geoff Hinton. He likes to annoy people, clearly it's working.
Love the controller example. Two comments:
1) Your assumption of frequency is half way there. It’s frequency X weighting. Altman’s dismissal of the value of quality training data is directly contradicted by GPT3 weighing quality datasets 10x or more. Same with Stable Diffusion and the Laion-aesthetic dataset, and even more so Midjourney and the artist hit list, as well as those screencaps. Which speaks directly to the fair use substantiality factor.
2) the tear drop video game controller is such a good example of what happens where, which seems to baffle so many:
• the UNDERSTANDING that a tear drop is a shape fit for a human hand came from the operator.
• the IDEA to combine those elements came from — and could only have come from — the operator.
• the ability to EXPRESS that idea as an apt prompt came from the operator.
• the ability to VISUALLY COMBINE those ideas in an image came from the AI system.
• the ability to VISUALLY EXPRESS those fundamental ideas came from the training data.
• the
Nice article. Thank you.
Your analysis sounds very much like the early chaos researchers experienced (like cloud formation leading to learning that faucet drips are chaotic). One conclusion was that predicting future weather events inherently relies on the model having perfect and total knowledge about all of the prior and current events affecting the weather everywhere, which, of course, is impossible in an absolute sense. Today’s weather models use vast amounts of information gleaned worldwide from a multitude of sources, and they’re only good for a relatively short period of time. (There are many other examples showing the same things.)
In LLM AI models, their limitations appear to me to be not enough “knowledge”. It would seem that the learning and training data sets need to be not only much larger and encompassing a broader range of subject matter, but also include a chaotic “emulator” that allows randomness to impact its direction and choices.
One aspect of the early chaos research was that the then computer chips truncated data at 22 or 23 significant digits to the right of the decimal point because they couldn’t physically handle any more. The prediction results were dramatically and randomly impacted by that unintentional truncation. In a perfect world in AI models there would be no truncation and, in theory, the model could use an “infinite” number of significant digits. But, perhaps, none of these things are in play!
These are merely the musings of a guy who is interested in seeing AI work as hoped. Maybe it could solve some of the dysfunction we hear and see every day. Or, make the dreams of Sci Fi travel a reality. Or….???
"how sensitive LLMs are to minor perturbations" ... this is not related to LLMs but it is related to how AI can achieve outstanding results in conjunction with evidence of a complete lack of understanding ... in this case the adversarial example are from the game of go: https://goattack.far.ai/game-analysis#contents ... I find it likely that you have seen this, but if not it is quite interesting ... a simple adversarial strategy that is weaker than a beginning human amateur reliably beats the strongest Go AI ... the game at the link above shows how the trick works and makes clear that the AI does not understand (the way humans do) the game of Go.
indeed, i wrote an earlier substack (with Go in the title) about it.
Thanks, I am going to check it out.