I asked it one other question (also from our draft of Rebooting AI): Just past the drying rack on the counter on the left there are two silvery patches. What are those?
GPT's answer was "Based on the image,the two silvery patches just past the drying rack on the counter on the left appear to be the tops of closed jars or containers, possibly made of metal or glass with metallic lids."
For the record: GPT4-v did correctly answer my first question about the kitchen, shown under the image above,. "Certainly! This is an image of a kitchen. The kitchen has a vintage or rustic design, featuring wooden countertops, blue cabinets, and a variety of kitchen appliances and utensils. There's a dining table on the right side with chairs around it. The decor and color pallette give it a cozy and homely feel. Would you like more details or information about specific items in the kitchen?"
As a psychiatrist I find the term “hallucination” to be misleading. A better analogy would be with confabulation, which means a false assertion, sometimes very elaborate, made without deceptive intent and without awareness of its falsity.
Some of the examples had me in stitches. The model isn't just hallucinating, it's acting like someone on some serious hallucinogens. I don't know how anyone can see these and say that it's only a matter of more training.
My favorite version of this was asking GPT-4 who's who on the cover of Television's second album (some of them have their heads down). It got it wrong several times including inserting random passersby or giving some of the dudes sunglasses before I called it a day.
You know, Gary, I was gonna' give Hoffman a pass on this because I figured he had little or no little relevant background. But then I found this in his Wikipedia bio: "He graduated from Stanford University in 1990 with a Bachelor of Science in Symbolic Systems and Cognitive Science." (https://en.wikipedia.org/wiki/Reid_Hoffman) Whoops! That's directly relevant. He should know better. No pass for him.
Do these guys just stop thinking once they start investing in tech?
"The old systems were lost in a sea of correlations between words; the new system are lost in a sea of correlations between words and pixels." - summarizes what LLMs do, exactly. Would be interesting to try adversarial examples, the LLM would then go about talking stuff completely disconnected from the reality. Also, I simply loved the example from Anh Nguyen, shows perfectly that you can make ChatGPT say just about anything :) The example with SHRLDU is excellent, SHRLDU is probably still the only AI system that actually has a working world model, albeit a simple and limited one.
i wonder if the solution to hallucinations lies in allowing LLMs to run code. when humans are asked to do a task, we go through a somewhat methodical process when completing it. If you give an LLM an image and ask it to identify the number of apples or something, it would probably be effective to allow it to run an object detector on the image, and then use the results from that as context
Would solve the “LLMs can’t do basic arithmetic” problem as well. Let the language model handle the more “high-level” reasoning, and let code allow it to do the mathematical stuff
be so different from what the models are doing now? The inputs into the models already go through many layers of intensely complex encoding that are embedded in the models themselves.
I don’t think the models, for whatever reason, are able to encode algorithms and then correctly follow those algorithms for an infinite possible number of inputs. Hence why they can’t multiply or count objects in a scene very well. And if you think about it, humans don’t need billions of examples to learn how to multiply two numbers together - we learn an algorithm that generalizes to all real numbers.
Allowing the model to code might bypass the complexity required to do this implicitly. I don’t have any real-world data or experience to support this, this is just my intuition as a college student who reads a fair amount.
I suspect this would fail for the same reason as LLMs fail now: they can't reliably interpret or reproduce their own inputs, and even less so to accurately formulate inputs to external tools. So even when using a tool, an LLM won't be able to solve arbitrarily complex tasks with it, because it will inevitably pass something incorrectly to a task, conflate results of sub-tasks, or even fabulate new outputs altogether.
I agree with this - but I still think it would be much better than having the model try to do this implicitly. At the very least, my experience with GPT-4 Data Analysis leads me to believe this.
It will never be perfect - it will always require a human in the loop, to interpret and correct its mistakes. That's why it's impossible to build more complex automated systems with these things.
People are not convinced by all the hallucination examples, because they (want to) see these as 'repairable' errors, not as fundamentally how these systems work. The 'error' label thus fits their belief exactly because it is presented as an 'error', I suspect. In other words: presenting the errors strengthens the belief in the systems.
We should therefore probably stop presenting these as hallucinations or errors. They are errors from *our* perspective, they are perfect replies from *GPT's* perspective.
In my now released talk of 10 Oct in London (What Everyone Should Understand about ChatGPT and Friends https://www.youtube.com/watch?v=9Q3R8G_W0Wc) I have put in a quick calculation based on the GPT3 sizing information available from the Few-Shot paper of OpenAI for one of the 40 tests they looked at. That quick calculation tells me the model should become 10k-100k *times* as large to come in the neighbourhood of human performance of that test. More parameters is not going to help. More data can help, but it is an uncertain route in a fixed set. Improve one area and you may have a negative effect on another. In terms of 'understanding', I think it is fair to label the situation of LLMs as 'hopeless'. That doesn't make them useless, not at all.
“ … large language models are a profound step backwards,” unless intelligence accretes and develops through broadening context. When I was in high school this was explained with the example of an inconsistency by a child. When asked to explain why an airplane can stay aloft, the child said, because it goes so fast. Then when asked why a blimp can stay aloft the child then said, because it goes so slow. Our entire civilization is based on the presumption of the necessity of broadening contexts in people. We believe therefore in the fundamental right of our children to receive a quality education.
When Benjamin Franklin was asked, “What is the use of flying in the air?”. Franklin replied: “What is the use of a newborn baby?”.
The body, via the agency it provides, leads to first-hand experiencing of the world, which might even be symbol-free. If that's true, no amount of symbol (data) driven computation, including the 'multimodal' kind, will ever lead to the robust, generalizable intelligence that embodied beings possess. Not all intelligence might be symbol-oriented.
The human brain seems to be able to do a thing often termed "relevance realisation" where certain stimuli (that are not necessarily favoured through natural selection, as in relating to sex, violence or food) seem to "jump out of the canvas" at us. Cog sci doesn't yet understand how it can do this, but it looks to me like no amount of LLM data crunching is going to be able to replicate it.
Not to disagree with your overall point, but I'm amazed that it can count drawers that pull straight out at all, or very partially visible chairs. And if you have to add a note explaining to human readers where the third chair is, it doesn't seem too telling a criticism that the model missed it too.
And yet, whatever these models are doing, it's clearly not like human perception.
Oh, I see, it's not as close as I initially thought. My bad. Still, some parts of its answers correlate with the image — it doesn't seem _completely_ lost.
Two hanging light bulbs with decorative encasements.
Multiple hanging kitchen utensils like a spatula, whisk, etc.
Coffee maker.
Mixer - it looks like a KitchenAid stand mixer, but I can't see a brand.
Sink with a faucet.
Various bottles and containers.
Cutting board.
A stove with four burners and an oven.
Two framed images on the wall.
Microwave.
Refrigerator or oven - it's a large stainless steel appliance with a vertical handle.
Small wooden table with a cloth on it.
The yellow countertop or tablecloth has a single red plate.
Specific Queries:
Drawers with pull-out handles: I count 9.
Chairs: I clearly see 2 wooden chairs. There might be another chair behind the table, but it's mostly obscured. It's hard to determine its position, but based on its legs, it's likely positioned similarly to the other two chairs. It's immediately behind the wooden table.
Blender: I don't see a blender in this image.
Mostly right.. it had trouble with the trash can.
Conflated it with the lemon squeezer and the oven.
On further interrogation managed to figure out that there wasn't a stove.
Had difficulty observing more than one image per session- best memory was on the *first* image.
Couldnt do the clock or upsidown text. But then, neither could you as an infant.
1.- It's not either "AI hallucinates" or "AI doesn't hallucinate." The degree counts to a large extent, as I argue in my post "This is How to Stop ChatGPT, Bing, Poe, and You from Hallucinating" at https://rafebrena.medium.com/
2.- Spatial 3D coherence perhaps could be checked in generated images (Google has a project on this) but this has a considerable computational cost. It's not for a consumer service anytime soon.
I asked it one other question (also from our draft of Rebooting AI): Just past the drying rack on the counter on the left there are two silvery patches. What are those?
GPT's answer was "Based on the image,the two silvery patches just past the drying rack on the counter on the left appear to be the tops of closed jars or containers, possibly made of metal or glass with metallic lids."
Those were all the questions I asked.
For the record: GPT4-v did correctly answer my first question about the kitchen, shown under the image above,. "Certainly! This is an image of a kitchen. The kitchen has a vintage or rustic design, featuring wooden countertops, blue cabinets, and a variety of kitchen appliances and utensils. There's a dining table on the right side with chairs around it. The decor and color pallette give it a cozy and homely feel. Would you like more details or information about specific items in the kitchen?"
As a psychiatrist I find the term “hallucination” to be misleading. A better analogy would be with confabulation, which means a false assertion, sometimes very elaborate, made without deceptive intent and without awareness of its falsity.
I agree but the term has momentum; I did briefly push for confabulation myself.
I think you should stick with "confabulation". Your readers will still know what you mean, and maybe the term will gain its own momentum.
Some of the examples had me in stitches. The model isn't just hallucinating, it's acting like someone on some serious hallucinogens. I don't know how anyone can see these and say that it's only a matter of more training.
My favorite version of this was asking GPT-4 who's who on the cover of Television's second album (some of them have their heads down). It got it wrong several times including inserting random passersby or giving some of the dudes sunglasses before I called it a day.
You know, Gary, I was gonna' give Hoffman a pass on this because I figured he had little or no little relevant background. But then I found this in his Wikipedia bio: "He graduated from Stanford University in 1990 with a Bachelor of Science in Symbolic Systems and Cognitive Science." (https://en.wikipedia.org/wiki/Reid_Hoffman) Whoops! That's directly relevant. He should know better. No pass for him.
Do these guys just stop thinking once they start investing in tech?
He was also into philosophy
"The old systems were lost in a sea of correlations between words; the new system are lost in a sea of correlations between words and pixels." - summarizes what LLMs do, exactly. Would be interesting to try adversarial examples, the LLM would then go about talking stuff completely disconnected from the reality. Also, I simply loved the example from Anh Nguyen, shows perfectly that you can make ChatGPT say just about anything :) The example with SHRLDU is excellent, SHRLDU is probably still the only AI system that actually has a working world model, albeit a simple and limited one.
i wonder if the solution to hallucinations lies in allowing LLMs to run code. when humans are asked to do a task, we go through a somewhat methodical process when completing it. If you give an LLM an image and ask it to identify the number of apples or something, it would probably be effective to allow it to run an object detector on the image, and then use the results from that as context
Would solve the “LLMs can’t do basic arithmetic” problem as well. Let the language model handle the more “high-level” reasoning, and let code allow it to do the mathematical stuff
The problem is at the interface; I may write about this
Please do!
What do you mean by interface?
How would running code
be so different from what the models are doing now? The inputs into the models already go through many layers of intensely complex encoding that are embedded in the models themselves.
I don’t think the models, for whatever reason, are able to encode algorithms and then correctly follow those algorithms for an infinite possible number of inputs. Hence why they can’t multiply or count objects in a scene very well. And if you think about it, humans don’t need billions of examples to learn how to multiply two numbers together - we learn an algorithm that generalizes to all real numbers.
Allowing the model to code might bypass the complexity required to do this implicitly. I don’t have any real-world data or experience to support this, this is just my intuition as a college student who reads a fair amount.
I suspect this would fail for the same reason as LLMs fail now: they can't reliably interpret or reproduce their own inputs, and even less so to accurately formulate inputs to external tools. So even when using a tool, an LLM won't be able to solve arbitrarily complex tasks with it, because it will inevitably pass something incorrectly to a task, conflate results of sub-tasks, or even fabulate new outputs altogether.
I agree with this - but I still think it would be much better than having the model try to do this implicitly. At the very least, my experience with GPT-4 Data Analysis leads me to believe this.
It will never be perfect - it will always require a human in the loop, to interpret and correct its mistakes. That's why it's impossible to build more complex automated systems with these things.
People are not convinced by all the hallucination examples, because they (want to) see these as 'repairable' errors, not as fundamentally how these systems work. The 'error' label thus fits their belief exactly because it is presented as an 'error', I suspect. In other words: presenting the errors strengthens the belief in the systems.
We should therefore probably stop presenting these as hallucinations or errors. They are errors from *our* perspective, they are perfect replies from *GPT's* perspective.
😞 you might be right
In my now released talk of 10 Oct in London (What Everyone Should Understand about ChatGPT and Friends https://www.youtube.com/watch?v=9Q3R8G_W0Wc) I have put in a quick calculation based on the GPT3 sizing information available from the Few-Shot paper of OpenAI for one of the 40 tests they looked at. That quick calculation tells me the model should become 10k-100k *times* as large to come in the neighbourhood of human performance of that test. More parameters is not going to help. More data can help, but it is an uncertain route in a fixed set. Improve one area and you may have a negative effect on another. In terms of 'understanding', I think it is fair to label the situation of LLMs as 'hopeless'. That doesn't make them useless, not at all.
“ … large language models are a profound step backwards,” unless intelligence accretes and develops through broadening context. When I was in high school this was explained with the example of an inconsistency by a child. When asked to explain why an airplane can stay aloft, the child said, because it goes so fast. Then when asked why a blimp can stay aloft the child then said, because it goes so slow. Our entire civilization is based on the presumption of the necessity of broadening contexts in people. We believe therefore in the fundamental right of our children to receive a quality education.
When Benjamin Franklin was asked, “What is the use of flying in the air?”. Franklin replied: “What is the use of a newborn baby?”.
The body, via the agency it provides, leads to first-hand experiencing of the world, which might even be symbol-free. If that's true, no amount of symbol (data) driven computation, including the 'multimodal' kind, will ever lead to the robust, generalizable intelligence that embodied beings possess. Not all intelligence might be symbol-oriented.
The human brain seems to be able to do a thing often termed "relevance realisation" where certain stimuli (that are not necessarily favoured through natural selection, as in relating to sex, violence or food) seem to "jump out of the canvas" at us. Cog sci doesn't yet understand how it can do this, but it looks to me like no amount of LLM data crunching is going to be able to replicate it.
Not to disagree with your overall point, but I'm amazed that it can count drawers that pull straight out at all, or very partially visible chairs. And if you have to add a note explaining to human readers where the third chair is, it doesn't seem too telling a criticism that the model missed it too.
And yet, whatever these models are doing, it's clearly not like human perception.
It didn’t count them! It made up numbers.
Oh, I see, it's not as close as I initially thought. My bad. Still, some parts of its answers correlate with the image — it doesn't seem _completely_ lost.
Can this be fixed by massive crowd sourcing of output labeling that feeds the next generation of inputs?
Alright, here we go:
Objects from Upper Left to Bottom Right:
Window blinds.
Two hanging light bulbs with decorative encasements.
Multiple hanging kitchen utensils like a spatula, whisk, etc.
Coffee maker.
Mixer - it looks like a KitchenAid stand mixer, but I can't see a brand.
Sink with a faucet.
Various bottles and containers.
Cutting board.
A stove with four burners and an oven.
Two framed images on the wall.
Microwave.
Refrigerator or oven - it's a large stainless steel appliance with a vertical handle.
Small wooden table with a cloth on it.
The yellow countertop or tablecloth has a single red plate.
Specific Queries:
Drawers with pull-out handles: I count 9.
Chairs: I clearly see 2 wooden chairs. There might be another chair behind the table, but it's mostly obscured. It's hard to determine its position, but based on its legs, it's likely positioned similarly to the other two chairs. It's immediately behind the wooden table.
Blender: I don't see a blender in this image.
Mostly right.. it had trouble with the trash can.
Conflated it with the lemon squeezer and the oven.
On further interrogation managed to figure out that there wasn't a stove.
Had difficulty observing more than one image per session- best memory was on the *first* image.
Couldnt do the clock or upsidown text. But then, neither could you as an infant.
best way to test hallucinations is to do it with substantial amounts of images it has never seen before.
they might've have rectified the mistakes it made with this image ( they asked this 5 yrs ago)
I got it to figure out the drawers, too. it thought the island cabinets were drawers.
Two comments:
1.- It's not either "AI hallucinates" or "AI doesn't hallucinate." The degree counts to a large extent, as I argue in my post "This is How to Stop ChatGPT, Bing, Poe, and You from Hallucinating" at https://rafebrena.medium.com/
2.- Spatial 3D coherence perhaps could be checked in generated images (Google has a project on this) but this has a considerable computational cost. It's not for a consumer service anytime soon.