It’s kind of surreal to compare some of the talks at TED yesterday with reality. Microsoft’s Mustafa Suleyman promised that hallucinations would be cured “soon”, yet my X feed is still filled with examples like these from Princeton Professor Aleksandra Korolova:
What is desirable about Khosla's visions #1 and #2?
AI doctors will be wrong for a long time. And speaking as Stage IV cancer patient, how are AIs going to develop empathy anytime soon, especially if they are themselves disembodied? In my experience, even embodied human doctors are mostly bad at this. Similarly, tutors are going to be teaching incorrect material -- and why should any child need a tutor 24/7?
As for #2, it sounds horrifying. Labor will be free -- for those who pay for labor. What will happen to those who *get paid for* labor? "Training," or the same old nonsense? Teaching everyone to code, even if they hate it? (Oh, wait, even GPT-4 does that already.) If this is techno-optimism, it's clearly only so if you're a member of the right economic class.
It's really difficult to take any of these VC/influencer talks seriously. Predictions that are beyond the time horizon where anyone will remember them or care that he made them?! He just has no idea how or when any of that will happen. Sure, all those things will come true but is this closer to the ancient Greeks saying one day man will set foot on the Moon, or Kennedy saying the same thing? The Greeks had no idea what they'd even have to learn to solve that problem; Kennedy knew it was just engineering by that point. Its the *perfect* TED Talk. Everyone claps and praises him, he notches another 'brave' TED Talk, and AI such as it is tells you it's a better mother than you are.
"Predictions that are beyond the time horizon where anyone will remember them or care that he made them"
Yep, this. Dan Gardner's book "Future Babble" is all about this (Nassim Taleb talks about it a lot, too). We just *love* hearing people talk about what the future will hold, and we're unphazed by the pathetic records of our past predictions.
And all this techno-optimism based on the transformer innovation in 2017 and how that led to a 175B parameter model in 2019, and after three (sometimes horrid) years of fine-tuning ChatGPT in 2022? Because all the breakthroughs we need to get from that to something actually resembling 'intelligence' and 'understanding' are *completely* unsolved/unknown. We have no clue how to do that, which basically puts any belief in all that squarely in the domain of alchemists being convinced we would get from lead to gold. Not in five hundred years, that one, regardless of the solid convictions.
"Understanding" means having an adequate model. LLM understands nothing about fluids, but a fluid modeling program "understands" quite a bit, if at a low level.
A tool that can give you a proof that involves logic manipulation and numerical computations "understands" logic and numbers. Not the higher-purpose, but at least what it is dealing with.
What is intriguing about current methods is that they give up on trying to reverse-engineer intelligence. The hope is that collecting and cataloging a lot of data about how we do things will allow the system to learn by imitation. Appropriate domain-specific modeling will help keep the system grounded.
We have no idea how to connect human-domain-specific modelling (e.g. Wolfram-, Cyc-like, etc.) and the pixel/token-sequence-domain specific modelling (current NNs/GenAI). E.g. a LLM has to 'understand' in order to 'decide' *when* to use *what* human-domain system, and for that the understanding is a prerequisite. But that understanding is then something those 'domain-specific modelling were supposed to bring. That's a Baron Münchhausen type of situation. It is circular.
"The hope is that collecting and cataloging a lot of data about how we do things will allow the system to learn by imitation. Appropriate domain-specific modeling will help keep the system grounded." Sure, that is the *hope*. More that that, it is the *assumption*. Which at this point has the same status as being convinced 500 years ago that all metals were one and the same so they could be converted into each other. We have convictions galore on this front. My point was that that is *all* we have.
We have no beginning of an idea how that could be done. One thing we do know, the initial hope was that those years of specific fine-tuning (you know, the cheap labor from English speaking African countries a.o.) would already be that 'learning by imitation'. That already has been shown to be a dead end (e.g. it doesn't really scale, most of the energy now is therefore directed at engineering around it).
As soon as you dig a bit below the surface, you see techniques that may have very useful and 'satisficing' uses, but *nothing* that supports that hope, nor. the direct one, nor the 'combination one'. On the contrary.
Maybe I'm wrong and I have missed something. But in that case, point me to something real, not 'hope'.
We already made good progress in going from pixels to labels. So, when a robot is in the kitchen, it will be able to look around and realize that this is a kitchen. That here's a cupboard. That's where the dishwasher is. That's the sink.
Then, it has to be given lots of rules, spelled out via text. Such as, the trash is under the sink. Must first open the door, bow down, peek inside, and among all the junk, find the trash can.
Then, it has to be told how to translate text that says "bow down and look" into motor commands.
That paper (a bit too much fawning language for my taste, btw) is an overview of something different: using LLMs to instruct robots of which the behaviour itself has been 'tokenised'. That is different from the problem of 'grounding LLM output in 'understanding'.
A good example of a real (and interesting) research paper on *that* subject is https://arxiv.org/pdf/2212.06817.pdf (from Google). That one has the Kitchen1 and Kitchen2 testing, for instance.
Even the technologies of the 1980s and 1990s found some niche usabilities. Real ones. Always (in my memory) by seriously limiting degrees of freedom, i.e. scaling the complexity of the problems space down. You see that in the LLM-robotic space now, where the complexity has been scaled down to a relatively small set of 'behavioural tokens' by which the robot is steered.
LLMs will too, I suspect, but the issue to get LLMs and symbolic models married, is completely unsolved, there are some tech tips and tricks (like GPT creating python code to execute when it 'detects' arithmetic, instead of trying to approximate arithmetic directly by continuation).
I'm still interested if you have a paper (preferably not just some submitted student overview thing without actual research) that shows some new thinking in how to make sure LLMs actually understand in the same way we for instance understand the 'equivalence' between 5-6-7 and 105-106-107.
Thank you Gerben, love the alchemy reference, especially having had the privilege of enjoying live gold smelting demos in Ballarat, Victoria, Australia where the staff poured molten gold into the pots to be cooled immediately in water to form solid gold bars (recreating the gold rush environment of the 1850s).
But before we can even talk about “intelligence” or “understanding”, can we first address predictability and reliability?
Gold production is deterministic (those touristy gold demos run several times a day). But RAG LLM output generation is not. I have invested countless hours into trying to stabilise LLM output in the hope of integrating GenAI into my SaaS pipeline. I have been too long in the lab with Llama3/Claude 3 Opus/Falcon 180b/Mistral/GPT 4 tests, my family can attest. I fear that I would not be able to achieve deterministic output from these overhyped models in my lifetime!
Deterministic output is against what these models *fundamentally* do. Their strength is on the 'creative' side (where many people oppose the use of 'creative' here, but that gets us in deep philosophy). The problem is that the creativity we want and the hallucinations we do not want are under water one and the same thing.
There is a difference between 'reliable' (enough) and 'deterministic'.
I agree with your points. And my concern is that such "creative" tools are being misapplied especially in critical domains like healthcare, with millions spent on model training and deployment in the corporate hope that profits can be made off the back of job redundancies.
And I worry too about nondeterministic "creative" LLM hallucinations being yet another needless burden a stretched medical environment would have to grapple with.
Plus, I would be utterly furious if my child was misdiagnosed by an LLM and an incorrect but urgent surgical procedure ordered on my child.
I wish that you were right about the medical industry having more guardrails.
Unfortunately the exact opposite may indeed happen.
This is because of the negative impact of inflation on the private hospital sector, as is the case here in Oceania. Many hospitals have had to cease operations with warnings of more to come.
With cost cutting measures and higher workloads, LLM safety (as in most industries) may be the first to go right to the bottom of the priority pile. Not does that adversely affect patient outcomes but it may also be weaponised against private health insurance customers in the form of "evidence" for lowered or declined payouts by their insurers.
They are absolutely going to agentify these untrustworthy systems no matter the human cost so they can fool themselves in thinking they are on the way to creating their dream ASI god. And adjacent to that, imagine biologists said they were creating a successor species; they would be jailed immediately.
That's the next moronic step. Then they'll put those untrustworthy agents into robots. Today's AI world is 90% people who want to be rich and 10% people who know what they're doing.
Putting an agent into a robot is, in fact, a great idea. Any time a robot does something dumb, that will be a learning experience (that hopefully won't get anybody killed).
The robot will have to learn from experience the countless rules of thumb that we use to do stuff, and hopefully find patterns that explain them.
I am not saying we should put a 100 horsepower metal beast in your living room. That would be bad. The robots should have very little force and torque. There are also designs with pneumatic muscles.
Such robots will, in the worst case, sulk in the corner, and won't have enough strength to lift a chair.
Then they will have to be trained to do something. LLM can give them ideas to try. Hallucination will make them somewhat ineffective, but that can be sorted out with more data based on their first-hand experience with failure.
And that was a year after he and Papert “proved” that connectionism was a dead end. Some days I think the only things bigger than the budgets of the corporations trying to score billions off of AI are the egos of the techies trying to do the scoring.
Humans have a bad record at predicting the future of technology. Either too optimistic or too negative. Thanks, Gary. I don’t always agree with you, but this is on point. We need to ask tough questions and have a BS meter for these talks and other assertions by AI technologists.
Why 2049? Why not 2050? Because 2050 sounds like a wild ass guess but 2949 sounds like a calculation that was reached somehow. But it's a wild ass guess. Same old AI predictions that have been given for decades.
Thank you, Gary, for that summary. I am the PR writer who worked with your book The Birth of the Mind. It is great to read this work, so sane on a topic scary to the nontech world.
I'm reminded of an episode of South Park, which I never watch. It's a business plan hatched by Underpants Gnomes, small humanoids that steal underpants. They have a three phase business plan: 1) collect underpants, 2) ? 3) profit. Instead we have: 1) scale up, 2) ? 3) prosperity for all, including a pony for every child.
The electricity production from nuclear plants was stigmatized by stirring up the fear of life hazard by radiation and radioactive contamination, based on Chernobyl and Fukushima examples. Nothing of that sort concerning AI proliferation for the moment. Nothing sufficiently dangerous and spectacular occurred to this day to warn all the people, to show them that this technology at its present state is not really ready for general no restricted use, that they are not protected against its potential negative consequences. My concern is not that there will be too much resistance from the public but that there will be too little resistance, too little criticism in the entire society. Apart experts issuing warnings in specialized conferences and publications, apart creators and editors worrying about their IP, there is little awareness in the global audience about this technology. Most of the main public media conveys a very positive, sometimes nearly enthusiastic image of our common glorious future with AI. No expert users, average people will adopt easily this not yet reliable, not yet safe, no yet regulated, no actually controllable technology simply because it is cheap, trendy, handy and apparently efficient. I wish there were more public discussion and opposition to this technology, a resistance allowing time to set up some regulations and to instill good practices and safety rules to all users.
Marcus writes, "We won’t get to a billion personal robots if they are as dodgy as driverless cars, frequently working, yet stymied often enough by outliers that we can’t fully count on them."
According to NPR stories, two companies plan to have driverless semi tractor trailers on selected roads in Texas by the end of the year. So, how to think about that?
Yes, we can't fully count on driverless vehicles. But as compared to what? A quick trip on any interstate highway reveals that MANY humans drivers are completely content to tailgate us at 75mph. And tailgating isn't really an adequate description, it's often more like NASCAR drafting. Vast numbers of human drivers truly don't care about anybody's safety, including their own.
So the question isn't, are driverless vehicles perfect? The question instead is, can driverless vehicles, on average, equal or exceed the quality of human drivers?
It seems we can apply this common sense logic to many things about AI. It's not enough to simply point out AI's flaws, we should be comparing those flaws to human flaws.
As one example, many people claim that AI text content is pretty low quality. Well, as compared to what? Have those making such claims experienced social media, the largest content trash pile in human history, all generated by humans?
What is desirable about Khosla's visions #1 and #2?
AI doctors will be wrong for a long time. And speaking as Stage IV cancer patient, how are AIs going to develop empathy anytime soon, especially if they are themselves disembodied? In my experience, even embodied human doctors are mostly bad at this. Similarly, tutors are going to be teaching incorrect material -- and why should any child need a tutor 24/7?
As for #2, it sounds horrifying. Labor will be free -- for those who pay for labor. What will happen to those who *get paid for* labor? "Training," or the same old nonsense? Teaching everyone to code, even if they hate it? (Oh, wait, even GPT-4 does that already.) If this is techno-optimism, it's clearly only so if you're a member of the right economic class.
It's a libertarian's utopia. Everything is reduced to owning land and choosing what to do with it.
It's really difficult to take any of these VC/influencer talks seriously. Predictions that are beyond the time horizon where anyone will remember them or care that he made them?! He just has no idea how or when any of that will happen. Sure, all those things will come true but is this closer to the ancient Greeks saying one day man will set foot on the Moon, or Kennedy saying the same thing? The Greeks had no idea what they'd even have to learn to solve that problem; Kennedy knew it was just engineering by that point. Its the *perfect* TED Talk. Everyone claps and praises him, he notches another 'brave' TED Talk, and AI such as it is tells you it's a better mother than you are.
"Predictions that are beyond the time horizon where anyone will remember them or care that he made them"
Yep, this. Dan Gardner's book "Future Babble" is all about this (Nassim Taleb talks about it a lot, too). We just *love* hearing people talk about what the future will hold, and we're unphazed by the pathetic records of our past predictions.
And all this techno-optimism based on the transformer innovation in 2017 and how that led to a 175B parameter model in 2019, and after three (sometimes horrid) years of fine-tuning ChatGPT in 2022? Because all the breakthroughs we need to get from that to something actually resembling 'intelligence' and 'understanding' are *completely* unsolved/unknown. We have no clue how to do that, which basically puts any belief in all that squarely in the domain of alchemists being convinced we would get from lead to gold. Not in five hundred years, that one, regardless of the solid convictions.
"Understanding" means having an adequate model. LLM understands nothing about fluids, but a fluid modeling program "understands" quite a bit, if at a low level.
A tool that can give you a proof that involves logic manipulation and numerical computations "understands" logic and numbers. Not the higher-purpose, but at least what it is dealing with.
What is intriguing about current methods is that they give up on trying to reverse-engineer intelligence. The hope is that collecting and cataloging a lot of data about how we do things will allow the system to learn by imitation. Appropriate domain-specific modeling will help keep the system grounded.
We have no idea how to connect human-domain-specific modelling (e.g. Wolfram-, Cyc-like, etc.) and the pixel/token-sequence-domain specific modelling (current NNs/GenAI). E.g. a LLM has to 'understand' in order to 'decide' *when* to use *what* human-domain system, and for that the understanding is a prerequisite. But that understanding is then something those 'domain-specific modelling were supposed to bring. That's a Baron Münchhausen type of situation. It is circular.
"The hope is that collecting and cataloging a lot of data about how we do things will allow the system to learn by imitation. Appropriate domain-specific modeling will help keep the system grounded." Sure, that is the *hope*. More that that, it is the *assumption*. Which at this point has the same status as being convinced 500 years ago that all metals were one and the same so they could be converted into each other. We have convictions galore on this front. My point was that that is *all* we have.
We have no beginning of an idea how that could be done. One thing we do know, the initial hope was that those years of specific fine-tuning (you know, the cheap labor from English speaking African countries a.o.) would already be that 'learning by imitation'. That already has been shown to be a dead end (e.g. it doesn't really scale, most of the energy now is therefore directed at engineering around it).
As soon as you dig a bit below the surface, you see techniques that may have very useful and 'satisficing' uses, but *nothing* that supports that hope, nor. the direct one, nor the 'combination one'. On the contrary.
Maybe I'm wrong and I have missed something. But in that case, point me to something real, not 'hope'.
We already made good progress in going from pixels to labels. So, when a robot is in the kitchen, it will be able to look around and realize that this is a kitchen. That here's a cupboard. That's where the dishwasher is. That's the sink.
Then, it has to be given lots of rules, spelled out via text. Such as, the trash is under the sink. Must first open the door, bow down, peek inside, and among all the junk, find the trash can.
Then, it has to be told how to translate text that says "bow down and look" into motor commands.
Sounds a little more than just "hope"?
References?
https://arxiv.org/pdf/2311.07226
That paper (a bit too much fawning language for my taste, btw) is an overview of something different: using LLMs to instruct robots of which the behaviour itself has been 'tokenised'. That is different from the problem of 'grounding LLM output in 'understanding'.
A good example of a real (and interesting) research paper on *that* subject is https://arxiv.org/pdf/2212.06817.pdf (from Google). That one has the Kitchen1 and Kitchen2 testing, for instance.
Even the technologies of the 1980s and 1990s found some niche usabilities. Real ones. Always (in my memory) by seriously limiting degrees of freedom, i.e. scaling the complexity of the problems space down. You see that in the LLM-robotic space now, where the complexity has been scaled down to a relatively small set of 'behavioural tokens' by which the robot is steered.
LLMs will too, I suspect, but the issue to get LLMs and symbolic models married, is completely unsolved, there are some tech tips and tricks (like GPT creating python code to execute when it 'detects' arithmetic, instead of trying to approximate arithmetic directly by continuation).
I guess https://ea.rna.nl/2022/10/24/on-the-psychology-of-architecture-and-the-architecture-of-psychology/ remains an important aspect. We see what we want to see. I read these papers and see the holes that fit my conviction. You read the same papers and see the confirmation of your conviction.
I'm still interested if you have a paper (preferably not just some submitted student overview thing without actual research) that shows some new thinking in how to make sure LLMs actually understand in the same way we for instance understand the 'equivalence' between 5-6-7 and 105-106-107.
Thank you Gerben, love the alchemy reference, especially having had the privilege of enjoying live gold smelting demos in Ballarat, Victoria, Australia where the staff poured molten gold into the pots to be cooled immediately in water to form solid gold bars (recreating the gold rush environment of the 1850s).
But before we can even talk about “intelligence” or “understanding”, can we first address predictability and reliability?
Gold production is deterministic (those touristy gold demos run several times a day). But RAG LLM output generation is not. I have invested countless hours into trying to stabilise LLM output in the hope of integrating GenAI into my SaaS pipeline. I have been too long in the lab with Llama3/Claude 3 Opus/Falcon 180b/Mistral/GPT 4 tests, my family can attest. I fear that I would not be able to achieve deterministic output from these overhyped models in my lifetime!
Deterministic output is against what these models *fundamentally* do. Their strength is on the 'creative' side (where many people oppose the use of 'creative' here, but that gets us in deep philosophy). The problem is that the creativity we want and the hallucinations we do not want are under water one and the same thing.
There is a difference between 'reliable' (enough) and 'deterministic'.
Thank you Gerben for your insightful reply.
I agree with your points. And my concern is that such "creative" tools are being misapplied especially in critical domains like healthcare, with millions spent on model training and deployment in the corporate hope that profits can be made off the back of job redundancies.
And I worry too about nondeterministic "creative" LLM hallucinations being yet another needless burden a stretched medical environment would have to grapple with.
Plus, I would be utterly furious if my child was misdiagnosed by an LLM and an incorrect but urgent surgical procedure ordered on my child.
I suspect the medical world has more guardrails than most businesses. I worry about IT landscapes filled with lots of very poor AI-generated code. The long term effect of that is going to be really bad. See https://www.linkedin.com/posts/gerbenwierda_debunking-devin-first-ai-software-engineer-activity-7185248033578655744-ojAR
Thanks Gerben.
I wish that you were right about the medical industry having more guardrails.
Unfortunately the exact opposite may indeed happen.
This is because of the negative impact of inflation on the private hospital sector, as is the case here in Oceania. Many hospitals have had to cease operations with warnings of more to come.
With cost cutting measures and higher workloads, LLM safety (as in most industries) may be the first to go right to the bottom of the priority pile. Not does that adversely affect patient outcomes but it may also be weaponised against private health insurance customers in the form of "evidence" for lowered or declined payouts by their insurers.
They are absolutely going to agentify these untrustworthy systems no matter the human cost so they can fool themselves in thinking they are on the way to creating their dream ASI god. And adjacent to that, imagine biologists said they were creating a successor species; they would be jailed immediately.
That's the next moronic step. Then they'll put those untrustworthy agents into robots. Today's AI world is 90% people who want to be rich and 10% people who know what they're doing.
Putting an agent into a robot is, in fact, a great idea. Any time a robot does something dumb, that will be a learning experience (that hopefully won't get anybody killed).
The robot will have to learn from experience the countless rules of thumb that we use to do stuff, and hopefully find patterns that explain them.
You're describing Russian roulette. "Hopefully" there's not a round in the chamber.
I am not saying we should put a 100 horsepower metal beast in your living room. That would be bad. The robots should have very little force and torque. There are also designs with pneumatic muscles.
Such robots will, in the worst case, sulk in the corner, and won't have enough strength to lift a chair.
Then they will have to be trained to do something. LLM can give them ideas to try. Hallucination will make them somewhat ineffective, but that can be sorted out with more data based on their first-hand experience with failure.
“In from three to eight years we will have a machine with the general intelligence of an average human being.”
---- Marvin Minsky, Life magazine, 1970:
And that was a year after he and Papert “proved” that connectionism was a dead end. Some days I think the only things bigger than the budgets of the corporations trying to score billions off of AI are the egos of the techies trying to do the scoring.
Humans have a bad record at predicting the future of technology. Either too optimistic or too negative. Thanks, Gary. I don’t always agree with you, but this is on point. We need to ask tough questions and have a BS meter for these talks and other assertions by AI technologists.
Helen Toner's talk was my favourite. We keep on seeing 'more calls' for AI auditing but not seeing anything substantial actually happen yet.
Why 2049? Why not 2050? Because 2050 sounds like a wild ass guess but 2949 sounds like a calculation that was reached somehow. But it's a wild ass guess. Same old AI predictions that have been given for decades.
Last year TED: a lot of wishful thinking with at least something to substantiate it.
This year’s TED: a lot of wishful thinking.
The interesting question is what are they smoking :)
Thank you, Gary, for that summary. I am the PR writer who worked with your book The Birth of the Mind. It is great to read this work, so sane on a topic scary to the nontech world.
I'm reminded of an episode of South Park, which I never watch. It's a business plan hatched by Underpants Gnomes, small humanoids that steal underpants. They have a three phase business plan: 1) collect underpants, 2) ? 3) profit. Instead we have: 1) scale up, 2) ? 3) prosperity for all, including a pony for every child.
Agree with your comparison to nuclear energy.
AI winters have been all about inflated expectations (I joined AI in the early 80s, during the deepest AI winter), so we better try to be real today.
I'd have to agree with the New Species aspect, they are, or will be unlike us in the way they "think"
I also suspect that the bitter lesson will continue to be bitter
The electricity production from nuclear plants was stigmatized by stirring up the fear of life hazard by radiation and radioactive contamination, based on Chernobyl and Fukushima examples. Nothing of that sort concerning AI proliferation for the moment. Nothing sufficiently dangerous and spectacular occurred to this day to warn all the people, to show them that this technology at its present state is not really ready for general no restricted use, that they are not protected against its potential negative consequences. My concern is not that there will be too much resistance from the public but that there will be too little resistance, too little criticism in the entire society. Apart experts issuing warnings in specialized conferences and publications, apart creators and editors worrying about their IP, there is little awareness in the global audience about this technology. Most of the main public media conveys a very positive, sometimes nearly enthusiastic image of our common glorious future with AI. No expert users, average people will adopt easily this not yet reliable, not yet safe, no yet regulated, no actually controllable technology simply because it is cheap, trendy, handy and apparently efficient. I wish there were more public discussion and opposition to this technology, a resistance allowing time to set up some regulations and to instill good practices and safety rules to all users.
Marcus writes, "We won’t get to a billion personal robots if they are as dodgy as driverless cars, frequently working, yet stymied often enough by outliers that we can’t fully count on them."
According to NPR stories, two companies plan to have driverless semi tractor trailers on selected roads in Texas by the end of the year. So, how to think about that?
Yes, we can't fully count on driverless vehicles. But as compared to what? A quick trip on any interstate highway reveals that MANY humans drivers are completely content to tailgate us at 75mph. And tailgating isn't really an adequate description, it's often more like NASCAR drafting. Vast numbers of human drivers truly don't care about anybody's safety, including their own.
So the question isn't, are driverless vehicles perfect? The question instead is, can driverless vehicles, on average, equal or exceed the quality of human drivers?
It seems we can apply this common sense logic to many things about AI. It's not enough to simply point out AI's flaws, we should be comparing those flaws to human flaws.
As one example, many people claim that AI text content is pretty low quality. Well, as compared to what? Have those making such claims experienced social media, the largest content trash pile in human history, all generated by humans?