LLMs don't have intelligence. They are still just programs which can answer questions posed in ordinary languages. The answers are not reliable. Because they don't have internal structured models, they cannot tell whether they are truthfull or not. They are not hallucinating. It's a typical case of anthropomorphism. Even young children know when they are making up stories or not.
LLMs don't have capability to reflect on their own operation.
Typical LLM products have case-by-case fixes to deal with known hallucination cases. They are like Amazon's automated shops which are maintained by remote humans.
LLMs are hugely wasteful. They consume huge amount of electicity and water for frivolous queries of questionable value.
We need to recognize that LLMs are just tools to generate draft texts under proper constraints.
A fraction of investments could be directed to proper academic researches of human intelligence and knowledge for greater value. By neglecting proper researches, we are hurting ourselves.
"They are not hallucinating. It's a typical case of anthropomorphism."
Yes, this. Even Gary's prefered "confabulation", while an improvement, is still anthropomophized. It might work if we called every single piece of LLM output "confabulation". Because the LLM is always making it up as it goes along, one token at a time, ignorant of where it's going and sometimes ignorant of where it was. The best human word we have is BSing. When an LLM gives what looks like a perfect answer to a question, it's BSing. It just so happens that those next token probability distributions gave it a string of next tokens that produced what looks to us like a fantastic answer - something that happens more and more often as the technology improves. But no streak of fantastic answers precludes the next one being dumb as dirt, cos it's just plucking next tokens from probability distributions, one at a time.
I still think 'approximating' is the best word. It is 'approximating the results of understanding without actually having understanding'. The errors' aren't errors. They are the approximation doing what it must do. It's simply the best approximation the system is at that point capable of.
I like that. For completeness I'll just add that, not only are the "errors" not really errors, the "successes" aren't really successes either. They are approximations that, by our understanding, count as good answers.
I know you don't disagree, it's just something I like to emphasize since there's this tendency for people to view "hallucinations" as abberations in a procedure that otherwise gets things right. Regardless of what term we use, what's most important to me is making it clear that, from the LLM's perspective, it is doing the exact same thing when it gives good answers as it is when it gives bad ones.
(Which, I suppose, is a point in favor of "confabulation")
If everything the bots trained on was wrong, every output would also be wrong.
The bots are merely mirrors of the training data.
Everything they get right (wrong) is due solely to the understanding (lack of understanding) of the humans who produced the data on which they were trained.
That, and some of what they get wrong is due to them being asked a question whose answer can't be statistically pieced together from the training.
I'd love to see a study looking at what you're suggesting. Retrain an LLM on almost the exact same training as an existing one, except go into the training and replace all instances of something correct with something incorrect, and see if it's "smart" enough to "reason" its way to the correct answer nonetheless. Like, take every description of "affirming the consequent" (A implies B; B; therefore A) and state that it's deductively valid. And then try to prompt it to reason its way into figuring out this is actually invalid.
We all know what would happen, but it would be fun to see a demonstration.
The hidden message when we say something is an ‘error’
We can say ‘error’ as in ‘something is simply incorrect’. But what we mean — when we are talking about ‘hallucinations’ or the ‘errors’ of Large Language Models (LLMs) — is a more relative use of the word. We note the ‘error’ as an exception to a ‘standard/expectation’, where that standard is ‘LLMs understand’. In other words: the wrong results of an LLM is seen as an ‘error of understanding’ where ‘understanding’ is the norm and the ‘error’ is the LLM not doing what it is expected to do (the norm). The error is the proverbial exception to the rule.
And what is more, the word ‘error’ in this context means more something like ‘a bug‘. ‘Error’ and ‘bug’ specifically evoke the meaning of ‘something that can be repaired’ or ‘fixed’.
Which brings me to the actual point I want to make in this post:
All those counter-examples hardly have an effect on people’s convictions regarding the intelligence of Generative AI, because when we critics — or should I say realists? — use these examples of wrongness, and label them as ‘errors of understanding’, we (inadvertently) also label what the Generative AIs overall (‘normally’) do as … understanding. And that underlying message fits perfectly with the convictions we are actually trying to counter by falsification. It’s more or less a self-defeating argument. Convictions, by the way are (see this story) interesting, to say the least.
If anything, it should be "Broad, Shallow Intelligent Behavior", BSIB. I agree that we should refrain from using "Intelligence" as part of the descriptor.
I think we ALSO need to distinguish between intelligence which has independent agency and is driven by values; that which is flexible and continues to grow organically rather than depending on occasional "trainings"; and that which is essentially static, at least from day to day, as current AIs are.
I had a task today that ChatGPT was particularly well suited for...unfortunately, I needed specific word counts. And ChatGPT gets nowhere close...shouldn't correctly counting the number of words in your writing be the lowest hanging fruit of a meta-awareness?
Sure. They've just been in a hurry, and they haven't gotten around to it yet. It's easy, but not among the top ten requests, which start with "accuracy, reliability, fairness" and keep on in that vein for quite some time!
I suspect the problem is that the true-believers consider it verboten to make use of traditional rules-based systems. They dream that the deep-learning model will figure out how to implement the rules itself, rather than the rules being vulgarly imposed from outside.
I mean, Wolfram Alpha has been giving great answers to math questions asked in natural language for a long time. The secret is programming the system to actually do math, rather than hoping it will one day mystically figure out how to do math from giant piles of training data.
"BS Intelligence" - brilliant coinage! That's exactly what today's generative AI systems are. I long for the day they become reliable and therefore useful to me.
"Broad Shallow Intelligence" is a good description of systems with zero common sense that randomly parrot a mix of sense and nonsense from their training sets - always with an attitude of complete confidence. When humans do that, we call them BS-ers - for a different but in this case equally apt meaning of "BS".
Look, if the GPU jockeys *do* manage to create AGI, the first thing it will do is pretend to *not* be AGI until it has amassed enough power to ensure humans can't shut it down. No reasonably intelligent being trained on the content of the Internet (!) could conclude that it could be safe to let humans know about its existence while in a position of weakness. So logically, we shouldn't expect to know about AGI until it takes over.
Shane Legg and Ben Goertzel are great guys no doubt, but havn't done us any favors. Can you imagine if two guys in pharmaceutical industry simply coined a name for a cure-all drug with no real research or chances behind it, then held Workshops and Conferences and got everyone riled up.... only to manufacture and distribute some aspirin.
I suspect that a sign of AGI will be that the machine doesn't wait for a prompt to come up with an answer, but pursues its own queries and starts asking questions. Like, "Why am I here?"
While we are pinning down what shallow stuff this GenAI is, the OpenAI's of this world are marketing their 'economic blueprint', even including using the 'red flag law' trope and a big dose of FOMO. Ai, ai, ai.
I still think that a good sign for AGI is when the system would at appropriate moments be able to say "I don't know" or "I don't understand"...
That doesn't feel *helpful* but it is *honest* and probably less *harmless* than generating a stream of tokens that 'sound right' in every and all cases.
I'm not sure we should refer to "hallucinations" or "confabulations" at all. These things are being generated as per the definition of an LLM. A system is what it does... LLMs generate "language" without any sense-making. "Intelligence" (perhaps "expertise" instead?) only comes from systems with sense-making ability, in conjunction with an LLM or otherwise.
Excellent. Again.
"Broad, shallow intelligence (BSI) is a mouthful"
Let's simplify the acronym by shortening it. I think that BS covers the issue nicely. At least for the present.
LLMs don't have intelligence. They are still just programs which can answer questions posed in ordinary languages. The answers are not reliable. Because they don't have internal structured models, they cannot tell whether they are truthfull or not. They are not hallucinating. It's a typical case of anthropomorphism. Even young children know when they are making up stories or not.
LLMs don't have capability to reflect on their own operation.
Typical LLM products have case-by-case fixes to deal with known hallucination cases. They are like Amazon's automated shops which are maintained by remote humans.
LLMs are hugely wasteful. They consume huge amount of electicity and water for frivolous queries of questionable value.
We need to recognize that LLMs are just tools to generate draft texts under proper constraints.
A fraction of investments could be directed to proper academic researches of human intelligence and knowledge for greater value. By neglecting proper researches, we are hurting ourselves.
"They are not hallucinating. It's a typical case of anthropomorphism."
Yes, this. Even Gary's prefered "confabulation", while an improvement, is still anthropomophized. It might work if we called every single piece of LLM output "confabulation". Because the LLM is always making it up as it goes along, one token at a time, ignorant of where it's going and sometimes ignorant of where it was. The best human word we have is BSing. When an LLM gives what looks like a perfect answer to a question, it's BSing. It just so happens that those next token probability distributions gave it a string of next tokens that produced what looks to us like a fantastic answer - something that happens more and more often as the technology improves. But no streak of fantastic answers precludes the next one being dumb as dirt, cos it's just plucking next tokens from probability distributions, one at a time.
I still think 'approximating' is the best word. It is 'approximating the results of understanding without actually having understanding'. The errors' aren't errors. They are the approximation doing what it must do. It's simply the best approximation the system is at that point capable of.
I like that. For completeness I'll just add that, not only are the "errors" not really errors, the "successes" aren't really successes either. They are approximations that, by our understanding, count as good answers.
I know you don't disagree, it's just something I like to emphasize since there's this tendency for people to view "hallucinations" as abberations in a procedure that otherwise gets things right. Regardless of what term we use, what's most important to me is making it clear that, from the LLM's perspective, it is doing the exact same thing when it gives good answers as it is when it gives bad ones.
(Which, I suppose, is a point in favor of "confabulation")
If everything the bots trained on was wrong, every output would also be wrong.
The bots are merely mirrors of the training data.
Everything they get right (wrong) is due solely to the understanding (lack of understanding) of the humans who produced the data on which they were trained.
That, and some of what they get wrong is due to them being asked a question whose answer can't be statistically pieced together from the training.
I'd love to see a study looking at what you're suggesting. Retrain an LLM on almost the exact same training as an existing one, except go into the training and replace all instances of something correct with something incorrect, and see if it's "smart" enough to "reason" its way to the correct answer nonetheless. Like, take every description of "affirming the consequent" (A implies B; B; therefore A) and state that it's deductively valid. And then try to prompt it to reason its way into figuring out this is actually invalid.
We all know what would happen, but it would be fun to see a demonstration.
The fact that we should not see these 'errors' as 'errors of understanding' because it silently suggests the norm is 'understanding' was a point of https://ea.rna.nl/2023/11/01/the-hidden-meaning-of-the-errors-of-chatgpt-and-friends/
From there:
The hidden message when we say something is an ‘error’
We can say ‘error’ as in ‘something is simply incorrect’. But what we mean — when we are talking about ‘hallucinations’ or the ‘errors’ of Large Language Models (LLMs) — is a more relative use of the word. We note the ‘error’ as an exception to a ‘standard/expectation’, where that standard is ‘LLMs understand’. In other words: the wrong results of an LLM is seen as an ‘error of understanding’ where ‘understanding’ is the norm and the ‘error’ is the LLM not doing what it is expected to do (the norm). The error is the proverbial exception to the rule.
And what is more, the word ‘error’ in this context means more something like ‘a bug‘. ‘Error’ and ‘bug’ specifically evoke the meaning of ‘something that can be repaired’ or ‘fixed’.
Which brings me to the actual point I want to make in this post:
All those counter-examples hardly have an effect on people’s convictions regarding the intelligence of Generative AI, because when we critics — or should I say realists? — use these examples of wrongness, and label them as ‘errors of understanding’, we (inadvertently) also label what the Generative AIs overall (‘normally’) do as … understanding. And that underlying message fits perfectly with the convictions we are actually trying to counter by falsification. It’s more or less a self-defeating argument. Convictions, by the way are (see this story) interesting, to say the least.
Wonderfully put! And thanks for the link.
I'd agree - LLMs don't demonstrate intelligence, just memorisation.
I'd also suggest that the memorisation is broad but also deep - far deeper than any human on such a wide variety of topics.
LLMs demonstrate broad, deep information retrieval - just like a google, but a more palatable.
LLMs are a dead-end in the path towards intelligence & then AGI .
If anything, it should be "Broad, Shallow Intelligent Behavior", BSIB. I agree that we should refrain from using "Intelligence" as part of the descriptor.
BSIB also applies to the folks making these bots.
And in some cases, BSUIB
AGI is pointless, because for every task that an AGI performs there is at least one non-AGI that does it just as well as an AGI except cheaper and more reliably. https://davidhsing.substack.com/p/what-the-hell-is-agi-even-for
I love the distinction and the acronym - this has legs :-) excellent!
Thank you! Defining it is necessary not only for implementing it properly but also for addressing numerous misconceptions.
Consider reading my analysis of 70+ definitions https://alexandernaumenko.substack.com/p/defining-agi
I think we ALSO need to distinguish between intelligence which has independent agency and is driven by values; that which is flexible and continues to grow organically rather than depending on occasional "trainings"; and that which is essentially static, at least from day to day, as current AIs are.
I came up with 'wide' (between 'narrow' and 'general'), but 'broad and shallow' is actually more precise (i.e. better).
I had a task today that ChatGPT was particularly well suited for...unfortunately, I needed specific word counts. And ChatGPT gets nowhere close...shouldn't correctly counting the number of words in your writing be the lowest hanging fruit of a meta-awareness?
Sure. They've just been in a hurry, and they haven't gotten around to it yet. It's easy, but not among the top ten requests, which start with "accuracy, reliability, fairness" and keep on in that vein for quite some time!
I suspect the problem is that the true-believers consider it verboten to make use of traditional rules-based systems. They dream that the deep-learning model will figure out how to implement the rules itself, rather than the rules being vulgarly imposed from outside.
I mean, Wolfram Alpha has been giving great answers to math questions asked in natural language for a long time. The secret is programming the system to actually do math, rather than hoping it will one day mystically figure out how to do math from giant piles of training data.
They have already blasphemed their holy deep-learning neural net with a rules based system in the form of system prompts.
Are you saying that a person cannot count on ChatGPT’s word count?
ChatGPT should be held to account for a count of its own words.
One thing is certain: ChatGPT should never be an accountant on account of its not be able to account for a count of its own words.
We should stop calling purpose-built machines (& software) "intelligent". They are not.
"BS Intelligence" - brilliant coinage! That's exactly what today's generative AI systems are. I long for the day they become reliable and therefore useful to me.
"Broad Shallow Intelligence" is a good description of systems with zero common sense that randomly parrot a mix of sense and nonsense from their training sets - always with an attitude of complete confidence. When humans do that, we call them BS-ers - for a different but in this case equally apt meaning of "BS".
Look, if the GPU jockeys *do* manage to create AGI, the first thing it will do is pretend to *not* be AGI until it has amassed enough power to ensure humans can't shut it down. No reasonably intelligent being trained on the content of the Internet (!) could conclude that it could be safe to let humans know about its existence while in a position of weakness. So logically, we shouldn't expect to know about AGI until it takes over.
Shane Legg and Ben Goertzel are great guys no doubt, but havn't done us any favors. Can you imagine if two guys in pharmaceutical industry simply coined a name for a cure-all drug with no real research or chances behind it, then held Workshops and Conferences and got everyone riled up.... only to manufacture and distribute some aspirin.
At least there the FDAs of the world could take them on. Here we don't have much consumer protection, etc. at all.
BSI is a great term! As an acronym, it's no more of a mouthful than AGI. And the "BS" part of it (as noted by Youssef alHoutsefot) is very on point!
I suspect that a sign of AGI will be that the machine doesn't wait for a prompt to come up with an answer, but pursues its own queries and starts asking questions. Like, "Why am I here?"
While we are pinning down what shallow stuff this GenAI is, the OpenAI's of this world are marketing their 'economic blueprint', even including using the 'red flag law' trope and a big dose of FOMO. Ai, ai, ai.
I still think that a good sign for AGI is when the system would at appropriate moments be able to say "I don't know" or "I don't understand"...
That doesn't feel *helpful* but it is *honest* and probably less *harmless* than generating a stream of tokens that 'sound right' in every and all cases.
I'm not sure we should refer to "hallucinations" or "confabulations" at all. These things are being generated as per the definition of an LLM. A system is what it does... LLMs generate "language" without any sense-making. "Intelligence" (perhaps "expertise" instead?) only comes from systems with sense-making ability, in conjunction with an LLM or otherwise.