The thing about promises is that in Silicon Valley, accountability rarely shows up. Investors put in over $100 billion into the driverless car industry, and so far have little to show far it. Endless promises (and empty predictions) were made at essentially no cost to those who made them. So what if Elon made bad predictions year after year? Nobody cares.
AI GENERAL'S WARNING: This product produces poetry and jokes ... it is not intended for use as a search engine, truth generator, fact retrieval system, or to be depended on in any real-life situation... 😂
Self-driving cars were not "here" back when he published this, and they are not "here" today. The fundamental challenges that prevent them from becoming ubiquitous have been very well known back then. Especially a very prominent, bright scientist like Ng should have known them.
As innovators, we all have to be stubborn optimists but that doesn't mean we can't be realists or we should ignore plain, clear, fundamental challenges. I believe it is serious disservice to the public to say things like "self-driving cars are here" when you should have known that they are multiple decades away from being table stakes, or "fundamental issues of LLMs will be solved in a few months" when you should know they will not be solved in a few months and more importantly we need several more breakthroughs before we can talk about actual AI, let alone AGI.
It is not intellectually lazy to have a tool narrow its focus to what you know is good data. It is in fact very clever and human-like. It is also not lazy if the bot is forced to use actual tools to do work rather than rehash text, and it is also how people do things.
Such techniques add a degree of control, interpretability, and grounding. Which black-box approaches sorely need. Maybe we can do better, but nobody ever showed alternatives.
Hi Gary, so true, about RAG being the next-in-line silver bullet :) By definition, it's a way to do external/extra computation to look up, query, search for, calculate, or reason (using human-orginated knowledge bases/graphs) things that the core LLM can't calculate by itself. So it's a useful technique for sure, but isn't a universal solution for intelligent behavior.
No, RAG is not a universal solution. It cannot be. We are still talking about mindless text manipulation, so RAG is giving that some constraints. That is useful, but not enough by any means.
Until then we are like those who thought going over the Niagara Falls in a barrel/using RAG LLMs to solve every critical business problem under the sun was a splendid idea…
Say what you want about Nassim Taleb, he understands that business is not an algorithmic endeavor. It's empirical, contingent, holistic. It even contains an element of mystery. Some humans have more difficulty with the concept of mystery than others, but there's general consensus agreement among us humans that the Unforeseen exists, as a potential. But I challenge anyone to try to get a computer to comprehend the concept of the Unforeseen.
AI has no empirical logic, because empiricism requires Experience. Experience presupposes the existence of a subject, possessing an experience-having capacity. Experience is required for evaluation. AI doesn't do evaluation. It can borrow someone else's evaluation under some circumstances, but that isn't the same thing. AI is so amorphous an entity that in some sense it's misleading to even refer to AI as an "it."
What AI consists of instead is a deductive logic capability that resides entirely in the idealized realm of programming input, which is combined with a massive calculation capability that processes a continually accumulating amount of data in order to yield output. That's awesome, but it's also entirely insufficient to fulfill the vaunted ambitions of AI programmers.
"HardFork" podcast (okay, but entirely too credulous of Big Tech claims IMO) had on CEO of Perplexity recently, which to me sounded like RAG. They called it an "answer engine" or some such, but it is pretty similar. LLM working with a function to find and summarize outside source materials. Regardless, it still hallucinates. He partially blamed that on the index (the webcrawler it works with) not updating fast enough. CEO also said there were still "hard problems" to solve, (laughably) tried to tell the podcast hosts, "Don't worry, your job is safe (despite your publisher getting no revenue from answers derived from your work)." https://podbay.fm/p/sway/e/1708077603.
While I absolutly agree that the current LLMs are quite a bit away from AGI and it is not assured they will eventually lead to AGI, I do differ in the view on Production-viability of RAG.
Sure it does take some effort and wont be able to answer every question in every situation.
But usually in big companies you have a lot of back office workers handling lots of questions and reading large documents while looking for the 1 relevant paragraph.
In my vision you would use the LLM only to encode the original documents into Vector databases and afterwards use the LLM to match the query to fitting vectors. You would not use it to then generate an answer based on these found vectors. You would not use it to search far and wide but let the user sharpen the use case and therefor the space of possibly relevant vectors.
With these relevant vectors returned to the employee they basically just saved a lot of search time and still get to make an informed decision
People already use LLMs to search for information. A non-trivial question is whether LLMs are a cost-effective solution to such a problem. What if you took a tiny fraction of the hundreds of billions of dollars spent on research, training and inference (plus manually checking for hallucinations) and invested that into solving the specific issue you have (e.g. by hiring someone to produce better documentation or by setting up a better information retrieval system)?
Somehow people imagine that the payoff from investment in LLMs will be endless and ultimately free, while the payoff from solving specific problems LLMs are supposed to solve is just a short-term waste of money. I think in most cases the situation is the exact opposite. It's just that convincing your company to invest into a hyped-up piece of AI is much easier than convincing them to invest into something concrete.
Besides, what you specifically describe does not sound like an LLM, but rather like a completely different AI system that would use some of the same components.
For it to be an interesting business case we (as a company looking into possible applications of this technology) can ignore the R&D costs and only have to evaluate the cost to run those LLMs and the infrastructure mentioned above, those are far less and actually decently cheap.
In my mind LLMs are quite good at understanding language&intent, basically an evolution of the classic chatbot intent recognition and definitly better than those we have built ourselves. Therefor we can use them to understand questions and match those with our own data. But as Marcus has mentioned there are several problems with LLMs so you have to think of them not as the complete&perfect solution but rather a tool to use at very specific points in a busines sprocess.
Like I said, its not AGI, but that doesnt mean its not useful
I worked in several huge multinationals and I could always find what I needed to know via graph based search engines. So it’s not clear to me what hit and miss RAG LLMs bring to the table…
Aside from the fact AI people can't write a paper without committing the Anthropomorphic Fallacy* this puts the kabosh on LLMs for legitimate uses and as a way forward.
"Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. In this paper, we formalize the problem and show that 𝗶𝘁 𝗶𝘀 𝗶𝗺𝗽𝗼𝘀𝘀𝗶𝗯𝗹𝗲 𝘁𝗼 𝗲𝗹𝗶𝗺𝗶𝗻𝗮𝘁𝗲 𝗵𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻 𝗶𝗻 𝗟𝗟𝗠𝘀. Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. " [emphasis added]
Xu, Ziwei, Sanjay Jain, and Mohan Kankanhalli. "Hallucination is inevitable: An innate limitation of large language models." arXiv preprint arXiv:2401.11817 (2024).
It's almost like in any reasonably useful and complicated mathematical system there will always be both true and false statements that cannot be proved within that system. Maybe someone should formalize that insight and write it up for 𝘔𝘰𝘯𝘢𝘵𝘴𝘩𝘦𝘧𝘵𝘦 𝘧ü𝘳 𝘔𝘢𝘵𝘩𝘦𝘮𝘢𝘵𝘪𝘬 𝘶𝘯𝘥 𝘗𝘩𝘺𝘴𝘪𝘬? Bet it would get a lot of attention.
* attributing human emotions and characteristics to inanimate objects and aspects of nature, such as plants, animals, or the weather.
(and an 'ACK' to Prof. Marcus for bringing the paper to my attention)
@garymarcus and @A Thornton: for a less formal but more mechanistic explanation of why training a large transformer-based language model on a large training set will yield hallucinations (but with a nuanced take on what those hallucinations may be), please take a look at the paper by Bernardo A. Huberman and me on SSRN: https://dx.doi.org/10.2139/ssrn.4676180
I have a theory... OpenAI described their recent meltdown on account of an update to "optimize" the user experience. Since RAG is the process of optimizing the output of a large language model, one has to wonder if it is production worthy. Now why OpenAI they didn't use a suite of auto regression tests before updating the new release is a mystery. And why Google didn't auto regress Gemini on its racial bias updates is another mystery .. but will table those questions.
[Regression Testing is defined as software testing to ensure that a recent code change has not adversely impacted the existing functionalities]
Separately I havn't had this much fun with a new technology. The stuff that came out of the valley for decades was robust, reliable, well-engineered, ... now I look at semiconductor industry updates from my previous life and everything works. It's boring !@!!
Ah for the days in which a one in a billion mistake was considered to be a major scandal, and the thought of not being able multiply a pair of six digit integers would have been unthinkable.
Purnima, I am pretty sure there was a mountain of regression tests that the engineers prepared. But I think it is impossible to catch everything. The next release will just contain an equally huge pile of bandaid fixes until another embarrasing puncture is publicised.
Using an LLM as a natural language interface to a traditional search engine (and a summariser of the returned documents) is just not very interesting conceptually.
Current generation LLMs hallucinate (while humans tend to say “no idea”) which makes them less useful for business tasks, but as scientific/engineering objects they are far more exciting. Early days and it will take decades to figure out what’s going on. Looks like we will drop the GPT architecture perhaps next year or the year after.
Nor is it actually very successful. I tried using Gemini for the same exercise as Gary has included here but it mostly would not produce any summary CV.
Even when I asked it to produce a summary of a copy of my main UoD CV (having removed the top few lines which identified me) it produced a very uninteresting mishmash.
Remember that RAG is only a way to add more context for the transformer of the LLM to work with. It is essentially Prompt Engineering on steroids.
To your point (and Gary Marcus's article summarising your experience and those of others), I managed on my first attempt to get ChatGPT using RAG to base its answer on an unreliable source (a far right conspiracy website) AND to hallucinate the content, making garbage up out of the already unreliable representation. (For clarification, I did not point ChatGPT to any source, it found it all by its lonesome).
Obviously performance at any one point in time is pertinent - for a business, what it can do for us today defines its present usefulness - but these arguments are always subject to the "wait 'til next release" or "you did it wrong - here's a better prompt" response. Also, it's easy to make CGPT-4 get something very simple wrong every time ("which has more rainfall, Edinburgh or Amsterdam?") and it will often get that wrong even with RAG switched on. As I said, for me, LLMs are interesting and cool, summarising traditional search engine results - not so much.
Matt Taibbi just wrote a Substack post about what Gemini did when he inquired about some of his own writing:
"...With each successive answer, Gemini didn’t “learn,” but instead began mixing up the fictional factoids from previous results and upping the ante, adding accusations of racism or bigotry. “The Great California Water Heist” turned into “The Great California Water Purge: How Nestle Bottled Its Way to a Billion-Dollar Empire—and Lied About It.” The “article” apparently featured this passage:
~~Look, if Nestle wants to avoid future public-relations problems, it should probably start by hiring executives whose noses aren’t shaped like giant penises.~~
I wouldn’t call that a good impersonation of my writing style, but it’s close enough that some would be fooled, which seems to be the idea.
An amazing follow-up passage explained that 'some raised concerns that the comment could be interpreted as antisemitic, as negative stereotypes about Jewish people have historically included references to large noses.'
hard to take seriously a criticism of RAG by someone running some prompts through Bing Chat - that is a black box. You have to know the details of the RAG to draw conclusions
We can infer from public facts. Copilot is based on a snapshot of GPT 4 running in Azure, “frozen” way before I set up Zingrevenue, my startup, late last year (as per the article). So the links and “facts” that Copilot supplied in my interactions with it were obviously supplemented by Bing as they were recent ones. Hence we can critique the wayward links and broken detail as RAG hallucinations. And Copilot is one of the world’s best RAG LLMs due to simple fact that MSFT’s (and Satya Nadella’s) reputation is riding on its performance and accuracy; the resources Redmond must be devoting to keeping it impressive must be very, very significant.
Also, the wording of your question suggests that I may not be in a position to understand how RAG systems work. While I am unable to supply commercial screenshots of my Falcon 40b LLM instance running with multiple GPUs on my GCP GKE cluster nor the source code of my TF2 containers in my private VPC, nor can I provide screenshots of all the ETL pipelines I have been building over a decade (down to the level of checking hex code of binary data payloads) to visualise that I know a little about the RAG data needed to keep LLMs in check, nor can I post the horribly complex decision trees spanning nearly a couple of decades, I beg to differ.
So, although it is true I don’t have access to Bing’s source code so yes you are right, it is a black box, Bing’s misdirected output is sufficient for me to cast my verdict.
What I can show you though is an easy peasy way to prepare an unprivileged OCI container that builds the latest copy of Tensorflow 2, the basis for my RAG work, from scratch, if that’s something you fancy 😉
Last year I attended a “hackathon” during which this became a significant issue. The machine literally made up “current” statistical data about the disparities among the different housing populations of Hudson Valley. After being asked to link to the sources, none of the links provided actually existed. Just imagine what kind of misinformation damage a faulty RAG system can do. This concern is beyond a dollar amount, and I hope to see it resolved in a way that benevolently furthers the technology.
The big picture in your essay is right. We need world models. But the solution that you seem to outline there, of some kind of calculus of symbols and concepts, is unlikely to be effective.
I think we will need a merger of the current paradigm with some kind of flexible probabilistic reasoning system (and by "reasoning" I don't mean poetic speculation about whether that's what a given inscrutable pile of linear algebra is "really" doing, I mean really strong built-in priors for actual symbolic manipulation). Arguably this was the architecture of AlphaGo and in my humble opinion that was a much more successful project than the current LLMs - besting the world's most expert humans in a highly technical domain and inventing whole new strategies that humans hadn't even thought of after playing the game for ~3000 years or so. You kind of *need* the creative capacity to go out on a limb, try something new that *seems* like it could be right (hallucinate, if you will), but if that's all you have then you're in a muddle. You also need symbolic reasoning to verify that the wild ideas are worth pursuing, but on the other hand if *that's* all you have then you're too rigid to succeed in challenging new circumstances. If symbols on their own were enough, then strong AI would have been cracked in the 60s or 70s with all the Lisp people.
Why are humans so successful, a quantum leap above all other animals in our command of our environment? Mammalian instincts certainly help, but symbols are what sets us apart. How can we build bridges, satellites, smartphones, communication networks? Symbols. Imagine trying to do engineering, physics, computer science without symbols. Imagine Schrodinger and Heisenberg and Einstein without symbols.
I'm inclined to think that there's more to the animal basis of human intelligence than "mammalian instincts." It's also about localized embodiment- the possession of an input and processing network including the totality of the nervous system, from the frontal and temporal lobes of the cerebrum through the cerebellum and medulla, the spinal cord, the nerves controlling body functions and musculature, on out to the ends of the peripheral nervous system.
The requirements of the biological body are what ground and orient the functional utility of human intelligence. To resort to an imprecise metaphor, that nexus resembles a comparator function in an electronic device, which requires a carrier wave. A radio tuner without a carrier wave has no processing stability. It's unmoored.
I think that something akin to that phenomenon accounts for AI gobbledygook. The problem there is that AI is never going to generate that locus on its own, any more than an engine- even the most powerful and precisely manufactured engine- is going to generate the transmission, drive train, wheels, etc. of a vehicle to surround it. Unlike the case with humans, who developed our cerebral capacity at the end of a very long chain of events that began with the dire necessity to possess an organismic body in order to sustain living animate existence. That requirement has informed every development in animal>>human neural processing that's taken place since then.
(I could go off into metaphysical speculations about humans developing sufficient complexity of conscious functioning that we may have a latent capability to access a higher level of intelligence extending beyond the body. But if that's the case- or a possibility- we've nonetheless required eons of organismic embodiment in order for our neural networks (not an anthropomorphic metaphor) to reach that level of sophistication. As a series of booster stages, so to speak. But, I'm only speculating...my own experiments in lucid dreaming could be explained all sorts of ways, notwithstanding the fact that I've had some success at that project.)
What that reality implies is that it isn't nearly sufficient for AI researchers to develop (some) temporal and frontal lobe functions in order to achieve artificial consciousness. Attaching those functions to motor command modules of robots isn't sufficient, either. There's still no embodiment comparator function present. As far as what the AI processor is doing, it's still unmoored from the baseline of perception and cognition found in terrestrial organisms. That sort of limitation on limitation is a requirement in order to achieve guaranteed "alignment" with human intelligence for the purpose of harm avoidance to the lifeforms of the planet, I think. I'm having a difficult time imagining how that mooring could be accomplished. The bottom line of AI is that it's nothing more than an electric circuit. There's no gravity, no internal sense of identity or chronological time sense, no prioperception, no mammalian/primate/human bandwidth tropisms or limitations. Computers can store a zillion images without a visual sense, and a zillion samples of music and speech without an auditory sense. In can mimic some aspects of tactile perception, but it feels neither pain or pleasure.
Perhaps it would be instructive to consider and catalog CPU capabilities not only in terms of what functions of the human brain they don't presently possess- but that which they're logically foreclosed from ever possessing, as autonomous features. There's a crucial difference between programming a robot to perform functions associated with human neural processing, and programming it to have the conscious experience of carrying out the tasks.
There's an article in the Washington Post today about a human-humanoid robot "social encounter." The article is charming and whimsical; the robots sound quite endearing in some respects. But at the foundation, the achievement is still stage magic. The robots are merely sophisticated ventriloquist dummies, with the ventriloquist(s) being the human programmer(s.)
AlphaGo's success was due to playing against itself and learning the ropes. That does offer a valuable lesson. LLM are best used as an initial guess. A reliable system should be able start with that and actually work through a problem, and figure out strategies along the way.
But it is much harder to implement this lesson in general, as compared to doing it in a constrained environment with fixed rules.
You're ignoring the fact that AlphaGo has a tree search built in. The policy network's instantaneous best guess would not beat the best human players - the tree search is required to reason about the consequences of that guess. Arguably this part is symbolic reasoning. The policy and value networks (subsequently fused into one in AphaZero) were indeed learned and were necessary to the success. I don't dispute that. But what I'm saying is that AlphGo wouldn't have succeeded without the plain old tree search also being built in - the success was because the learned parts provided very good heuristics for pruning the search space.
I don't think I dispute that, but I think we may be focused on different aspects of intelligence. You may be focused on "fast" thinking while I'm focused on "slow" thinking. Fast thinking is instinct and gut feeling, the ability to rapidly identify objects in a field of view or jump out of the way of an oncoming bus. Most of the animal kingdom has it. I would argue that most existing DL architectures do "fast" thinking. Indeed, in terms of computational complexity, they literally embody fixed-cost computations - there is no variable-depth search or recursion, just a forward pass through a static set of components. But much what humans do when they do science is "slow" thinking - computations that are variable in length, involve fits and starts with feedback and refinement, and even involve other minds working collectively through peer review. And symbols are always involved in that process - they are a key mechanism for scientific knowledge to propagate itself from person to person and generation to generation. And I don't think you can do a lot of what we do without a deep development of this kind of slow, social, symbol-assisted cognition. Otherwise chimpanzees would be building bridges and airplanes and cellphones by now.
AI GENERAL'S WARNING: This product produces poetry and jokes ... it is not intended for use as a search engine, truth generator, fact retrieval system, or to be depended on in any real-life situation... 😂
And yet real organisations build real solutions with RAG - in real life situations 🙃
I for one cannot believe all of this AI doesn’t work perfectly yet because all other technologies that are older work great.
This is not the first time Ng is being "overly optimistic", giving him the benefit of doubt. Here is an article by him some 6 years ago: https://medium.com/@andrewng/self-driving-cars-are-here-aea1752b1ad0
Self-driving cars were not "here" back when he published this, and they are not "here" today. The fundamental challenges that prevent them from becoming ubiquitous have been very well known back then. Especially a very prominent, bright scientist like Ng should have known them.
As innovators, we all have to be stubborn optimists but that doesn't mean we can't be realists or we should ignore plain, clear, fundamental challenges. I believe it is serious disservice to the public to say things like "self-driving cars are here" when you should have known that they are multiple decades away from being table stakes, or "fundamental issues of LLMs will be solved in a few months" when you should know they will not be solved in a few months and more importantly we need several more breakthroughs before we can talk about actual AI, let alone AGI.
Typos:
"doceuments"
"which is generally believed to be include"
It's just one intellectually-lazy band-aid after another...
It is not intellectually lazy to have a tool narrow its focus to what you know is good data. It is in fact very clever and human-like. It is also not lazy if the bot is forced to use actual tools to do work rather than rehash text, and it is also how people do things.
Such techniques add a degree of control, interpretability, and grounding. Which black-box approaches sorely need. Maybe we can do better, but nobody ever showed alternatives.
There is literally nothing “lazy” about building robust RAG pipelines. Anyone who ever built one knows this.
Shhh, the fact that businesses have already found much success in implementing RAG runs counter to Gary’s whole AI-skeptic grift
They are also not disclosing the failures, unfortunately. Nobody wants to admit failure 😅
Mischaracterization. Lots of hard work reasonably going on.
Why don't you provide a solution?
Who says I’m not?
https://www.bigmother.ai/
Hi Gary, so true, about RAG being the next-in-line silver bullet :) By definition, it's a way to do external/extra computation to look up, query, search for, calculate, or reason (using human-orginated knowledge bases/graphs) things that the core LLM can't calculate by itself. So it's a useful technique for sure, but isn't a universal solution for intelligent behavior.
No, RAG is not a universal solution. It cannot be. We are still talking about mindless text manipulation, so RAG is giving that some constraints. That is useful, but not enough by any means.
Looking for silver bullets is fool's errand.
All we need are a few more breakthroughs in bionic technology, and we'll be able to have fully functioning wings grafted to our bodies.
We'll be able to take off from the ground, and fly anywhere we want! So cool!
Until then we are like those who thought going over the Niagara Falls in a barrel/using RAG LLMs to solve every critical business problem under the sun was a splendid idea…
Say what you want about Nassim Taleb, he understands that business is not an algorithmic endeavor. It's empirical, contingent, holistic. It even contains an element of mystery. Some humans have more difficulty with the concept of mystery than others, but there's general consensus agreement among us humans that the Unforeseen exists, as a potential. But I challenge anyone to try to get a computer to comprehend the concept of the Unforeseen.
AI has no empirical logic, because empiricism requires Experience. Experience presupposes the existence of a subject, possessing an experience-having capacity. Experience is required for evaluation. AI doesn't do evaluation. It can borrow someone else's evaluation under some circumstances, but that isn't the same thing. AI is so amorphous an entity that in some sense it's misleading to even refer to AI as an "it."
What AI consists of instead is a deductive logic capability that resides entirely in the idealized realm of programming input, which is combined with a massive calculation capability that processes a continually accumulating amount of data in order to yield output. That's awesome, but it's also entirely insufficient to fulfill the vaunted ambitions of AI programmers.
"HardFork" podcast (okay, but entirely too credulous of Big Tech claims IMO) had on CEO of Perplexity recently, which to me sounded like RAG. They called it an "answer engine" or some such, but it is pretty similar. LLM working with a function to find and summarize outside source materials. Regardless, it still hallucinates. He partially blamed that on the index (the webcrawler it works with) not updating fast enough. CEO also said there were still "hard problems" to solve, (laughably) tried to tell the podcast hosts, "Don't worry, your job is safe (despite your publisher getting no revenue from answers derived from your work)." https://podbay.fm/p/sway/e/1708077603.
While I absolutly agree that the current LLMs are quite a bit away from AGI and it is not assured they will eventually lead to AGI, I do differ in the view on Production-viability of RAG.
Sure it does take some effort and wont be able to answer every question in every situation.
But usually in big companies you have a lot of back office workers handling lots of questions and reading large documents while looking for the 1 relevant paragraph.
In my vision you would use the LLM only to encode the original documents into Vector databases and afterwards use the LLM to match the query to fitting vectors. You would not use it to then generate an answer based on these found vectors. You would not use it to search far and wide but let the user sharpen the use case and therefor the space of possibly relevant vectors.
With these relevant vectors returned to the employee they basically just saved a lot of search time and still get to make an informed decision
People already use LLMs to search for information. A non-trivial question is whether LLMs are a cost-effective solution to such a problem. What if you took a tiny fraction of the hundreds of billions of dollars spent on research, training and inference (plus manually checking for hallucinations) and invested that into solving the specific issue you have (e.g. by hiring someone to produce better documentation or by setting up a better information retrieval system)?
Somehow people imagine that the payoff from investment in LLMs will be endless and ultimately free, while the payoff from solving specific problems LLMs are supposed to solve is just a short-term waste of money. I think in most cases the situation is the exact opposite. It's just that convincing your company to invest into a hyped-up piece of AI is much easier than convincing them to invest into something concrete.
Besides, what you specifically describe does not sound like an LLM, but rather like a completely different AI system that would use some of the same components.
For it to be an interesting business case we (as a company looking into possible applications of this technology) can ignore the R&D costs and only have to evaluate the cost to run those LLMs and the infrastructure mentioned above, those are far less and actually decently cheap.
In my mind LLMs are quite good at understanding language&intent, basically an evolution of the classic chatbot intent recognition and definitly better than those we have built ourselves. Therefor we can use them to understand questions and match those with our own data. But as Marcus has mentioned there are several problems with LLMs so you have to think of them not as the complete&perfect solution but rather a tool to use at very specific points in a busines sprocess.
Like I said, its not AGI, but that doesnt mean its not useful
sure, I can imagine some value in this more scoped use case
I worked in several huge multinationals and I could always find what I needed to know via graph based search engines. So it’s not clear to me what hit and miss RAG LLMs bring to the table…
Aside from the fact AI people can't write a paper without committing the Anthropomorphic Fallacy* this puts the kabosh on LLMs for legitimate uses and as a way forward.
"Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. In this paper, we formalize the problem and show that 𝗶𝘁 𝗶𝘀 𝗶𝗺𝗽𝗼𝘀𝘀𝗶𝗯𝗹𝗲 𝘁𝗼 𝗲𝗹𝗶𝗺𝗶𝗻𝗮𝘁𝗲 𝗵𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻 𝗶𝗻 𝗟𝗟𝗠𝘀. Specifically, we define a formal world where hallucination is defined as inconsistencies between a computable LLM and a computable ground truth function. " [emphasis added]
Xu, Ziwei, Sanjay Jain, and Mohan Kankanhalli. "Hallucination is inevitable: An innate limitation of large language models." arXiv preprint arXiv:2401.11817 (2024).
It's almost like in any reasonably useful and complicated mathematical system there will always be both true and false statements that cannot be proved within that system. Maybe someone should formalize that insight and write it up for 𝘔𝘰𝘯𝘢𝘵𝘴𝘩𝘦𝘧𝘵𝘦 𝘧ü𝘳 𝘔𝘢𝘵𝘩𝘦𝘮𝘢𝘵𝘪𝘬 𝘶𝘯𝘥 𝘗𝘩𝘺𝘴𝘪𝘬? Bet it would get a lot of attention.
* attributing human emotions and characteristics to inanimate objects and aspects of nature, such as plants, animals, or the weather.
(and an 'ACK' to Prof. Marcus for bringing the paper to my attention)
@garymarcus and @A Thornton: for a less formal but more mechanistic explanation of why training a large transformer-based language model on a large training set will yield hallucinations (but with a nuanced take on what those hallucinations may be), please take a look at the paper by Bernardo A. Huberman and me on SSRN: https://dx.doi.org/10.2139/ssrn.4676180
I have a theory... OpenAI described their recent meltdown on account of an update to "optimize" the user experience. Since RAG is the process of optimizing the output of a large language model, one has to wonder if it is production worthy. Now why OpenAI they didn't use a suite of auto regression tests before updating the new release is a mystery. And why Google didn't auto regress Gemini on its racial bias updates is another mystery .. but will table those questions.
[Regression Testing is defined as software testing to ensure that a recent code change has not adversely impacted the existing functionalities]
Separately I havn't had this much fun with a new technology. The stuff that came out of the valley for decades was robust, reliable, well-engineered, ... now I look at semiconductor industry updates from my previous life and everything works. It's boring !@!!
Ah for the days in which a one in a billion mistake was considered to be a major scandal, and the thought of not being able multiply a pair of six digit integers would have been unthinkable.
Purnima, I am pretty sure there was a mountain of regression tests that the engineers prepared. But I think it is impossible to catch everything. The next release will just contain an equally huge pile of bandaid fixes until another embarrasing puncture is publicised.
Exactly!
Using an LLM as a natural language interface to a traditional search engine (and a summariser of the returned documents) is just not very interesting conceptually.
Current generation LLMs hallucinate (while humans tend to say “no idea”) which makes them less useful for business tasks, but as scientific/engineering objects they are far more exciting. Early days and it will take decades to figure out what’s going on. Looks like we will drop the GPT architecture perhaps next year or the year after.
Nor is it actually very successful. I tried using Gemini for the same exercise as Gary has included here but it mostly would not produce any summary CV.
Even when I asked it to produce a summary of a copy of my main UoD CV (having removed the top few lines which identified me) it produced a very uninteresting mishmash.
Remember that RAG is only a way to add more context for the transformer of the LLM to work with. It is essentially Prompt Engineering on steroids.
To your point (and Gary Marcus's article summarising your experience and those of others), I managed on my first attempt to get ChatGPT using RAG to base its answer on an unreliable source (a far right conspiracy website) AND to hallucinate the content, making garbage up out of the already unreliable representation. (For clarification, I did not point ChatGPT to any source, it found it all by its lonesome).
Obviously performance at any one point in time is pertinent - for a business, what it can do for us today defines its present usefulness - but these arguments are always subject to the "wait 'til next release" or "you did it wrong - here's a better prompt" response. Also, it's easy to make CGPT-4 get something very simple wrong every time ("which has more rainfall, Edinburgh or Amsterdam?") and it will often get that wrong even with RAG switched on. As I said, for me, LLMs are interesting and cool, summarising traditional search engine results - not so much.
(Thinking in particular about the mamba architecture).
Matt Taibbi just wrote a Substack post about what Gemini did when he inquired about some of his own writing:
"...With each successive answer, Gemini didn’t “learn,” but instead began mixing up the fictional factoids from previous results and upping the ante, adding accusations of racism or bigotry. “The Great California Water Heist” turned into “The Great California Water Purge: How Nestle Bottled Its Way to a Billion-Dollar Empire—and Lied About It.” The “article” apparently featured this passage:
~~Look, if Nestle wants to avoid future public-relations problems, it should probably start by hiring executives whose noses aren’t shaped like giant penises.~~
I wouldn’t call that a good impersonation of my writing style, but it’s close enough that some would be fooled, which seems to be the idea.
An amazing follow-up passage explained that 'some raised concerns that the comment could be interpreted as antisemitic, as negative stereotypes about Jewish people have historically included references to large noses.'
I stared at the image, amazed. Google’s AI created both scandal and outraged reaction, a fully faked news cycle: https://www.racket.news/p/i-wrote-what-googles-ai-powered-libel ..."
More at the jump. Go on, take that leap.
hard to take seriously a criticism of RAG by someone running some prompts through Bing Chat - that is a black box. You have to know the details of the RAG to draw conclusions
We can infer from public facts. Copilot is based on a snapshot of GPT 4 running in Azure, “frozen” way before I set up Zingrevenue, my startup, late last year (as per the article). So the links and “facts” that Copilot supplied in my interactions with it were obviously supplemented by Bing as they were recent ones. Hence we can critique the wayward links and broken detail as RAG hallucinations. And Copilot is one of the world’s best RAG LLMs due to simple fact that MSFT’s (and Satya Nadella’s) reputation is riding on its performance and accuracy; the resources Redmond must be devoting to keeping it impressive must be very, very significant.
Also, the wording of your question suggests that I may not be in a position to understand how RAG systems work. While I am unable to supply commercial screenshots of my Falcon 40b LLM instance running with multiple GPUs on my GCP GKE cluster nor the source code of my TF2 containers in my private VPC, nor can I provide screenshots of all the ETL pipelines I have been building over a decade (down to the level of checking hex code of binary data payloads) to visualise that I know a little about the RAG data needed to keep LLMs in check, nor can I post the horribly complex decision trees spanning nearly a couple of decades, I beg to differ.
So, although it is true I don’t have access to Bing’s source code so yes you are right, it is a black box, Bing’s misdirected output is sufficient for me to cast my verdict.
What I can show you though is an easy peasy way to prepare an unprivileged OCI container that builds the latest copy of Tensorflow 2, the basis for my RAG work, from scratch, if that’s something you fancy 😉
https://www.linkedin.com/pulse/building-tensorflow-ai-from-source-container-simon-au-yong-joo9c?trk=public_post
Last year I attended a “hackathon” during which this became a significant issue. The machine literally made up “current” statistical data about the disparities among the different housing populations of Hudson Valley. After being asked to link to the sources, none of the links provided actually existed. Just imagine what kind of misinformation damage a faulty RAG system can do. This concern is beyond a dollar amount, and I hope to see it resolved in a way that benevolently furthers the technology.
Gary, what do you mean by "neurosymbolic AI"?
subject of a future essay, or see my 2000 arxiv The Next Decade in AI
The big picture in your essay is right. We need world models. But the solution that you seem to outline there, of some kind of calculus of symbols and concepts, is unlikely to be effective.
I think we will need a merger of the current paradigm with some kind of flexible probabilistic reasoning system (and by "reasoning" I don't mean poetic speculation about whether that's what a given inscrutable pile of linear algebra is "really" doing, I mean really strong built-in priors for actual symbolic manipulation). Arguably this was the architecture of AlphaGo and in my humble opinion that was a much more successful project than the current LLMs - besting the world's most expert humans in a highly technical domain and inventing whole new strategies that humans hadn't even thought of after playing the game for ~3000 years or so. You kind of *need* the creative capacity to go out on a limb, try something new that *seems* like it could be right (hallucinate, if you will), but if that's all you have then you're in a muddle. You also need symbolic reasoning to verify that the wild ideas are worth pursuing, but on the other hand if *that's* all you have then you're too rigid to succeed in challenging new circumstances. If symbols on their own were enough, then strong AI would have been cracked in the 60s or 70s with all the Lisp people.
Why are humans so successful, a quantum leap above all other animals in our command of our environment? Mammalian instincts certainly help, but symbols are what sets us apart. How can we build bridges, satellites, smartphones, communication networks? Symbols. Imagine trying to do engineering, physics, computer science without symbols. Imagine Schrodinger and Heisenberg and Einstein without symbols.
I'm inclined to think that there's more to the animal basis of human intelligence than "mammalian instincts." It's also about localized embodiment- the possession of an input and processing network including the totality of the nervous system, from the frontal and temporal lobes of the cerebrum through the cerebellum and medulla, the spinal cord, the nerves controlling body functions and musculature, on out to the ends of the peripheral nervous system.
The requirements of the biological body are what ground and orient the functional utility of human intelligence. To resort to an imprecise metaphor, that nexus resembles a comparator function in an electronic device, which requires a carrier wave. A radio tuner without a carrier wave has no processing stability. It's unmoored.
I think that something akin to that phenomenon accounts for AI gobbledygook. The problem there is that AI is never going to generate that locus on its own, any more than an engine- even the most powerful and precisely manufactured engine- is going to generate the transmission, drive train, wheels, etc. of a vehicle to surround it. Unlike the case with humans, who developed our cerebral capacity at the end of a very long chain of events that began with the dire necessity to possess an organismic body in order to sustain living animate existence. That requirement has informed every development in animal>>human neural processing that's taken place since then.
(I could go off into metaphysical speculations about humans developing sufficient complexity of conscious functioning that we may have a latent capability to access a higher level of intelligence extending beyond the body. But if that's the case- or a possibility- we've nonetheless required eons of organismic embodiment in order for our neural networks (not an anthropomorphic metaphor) to reach that level of sophistication. As a series of booster stages, so to speak. But, I'm only speculating...my own experiments in lucid dreaming could be explained all sorts of ways, notwithstanding the fact that I've had some success at that project.)
What that reality implies is that it isn't nearly sufficient for AI researchers to develop (some) temporal and frontal lobe functions in order to achieve artificial consciousness. Attaching those functions to motor command modules of robots isn't sufficient, either. There's still no embodiment comparator function present. As far as what the AI processor is doing, it's still unmoored from the baseline of perception and cognition found in terrestrial organisms. That sort of limitation on limitation is a requirement in order to achieve guaranteed "alignment" with human intelligence for the purpose of harm avoidance to the lifeforms of the planet, I think. I'm having a difficult time imagining how that mooring could be accomplished. The bottom line of AI is that it's nothing more than an electric circuit. There's no gravity, no internal sense of identity or chronological time sense, no prioperception, no mammalian/primate/human bandwidth tropisms or limitations. Computers can store a zillion images without a visual sense, and a zillion samples of music and speech without an auditory sense. In can mimic some aspects of tactile perception, but it feels neither pain or pleasure.
Perhaps it would be instructive to consider and catalog CPU capabilities not only in terms of what functions of the human brain they don't presently possess- but that which they're logically foreclosed from ever possessing, as autonomous features. There's a crucial difference between programming a robot to perform functions associated with human neural processing, and programming it to have the conscious experience of carrying out the tasks.
There's an article in the Washington Post today about a human-humanoid robot "social encounter." The article is charming and whimsical; the robots sound quite endearing in some respects. But at the foundation, the achievement is still stage magic. The robots are merely sophisticated ventriloquist dummies, with the ventriloquist(s) being the human programmer(s.)
AlphaGo's success was due to playing against itself and learning the ropes. That does offer a valuable lesson. LLM are best used as an initial guess. A reliable system should be able start with that and actually work through a problem, and figure out strategies along the way.
But it is much harder to implement this lesson in general, as compared to doing it in a constrained environment with fixed rules.
You're ignoring the fact that AlphaGo has a tree search built in. The policy network's instantaneous best guess would not beat the best human players - the tree search is required to reason about the consequences of that guess. Arguably this part is symbolic reasoning. The policy and value networks (subsequently fused into one in AphaZero) were indeed learned and were necessary to the success. I don't dispute that. But what I'm saying is that AlphGo wouldn't have succeeded without the plain old tree search also being built in - the success was because the learned parts provided very good heuristics for pruning the search space.
We are smart not because of symbols, but because we can go beyond symbols. We actually know the true nature of things at a very fine level.
I don't think I dispute that, but I think we may be focused on different aspects of intelligence. You may be focused on "fast" thinking while I'm focused on "slow" thinking. Fast thinking is instinct and gut feeling, the ability to rapidly identify objects in a field of view or jump out of the way of an oncoming bus. Most of the animal kingdom has it. I would argue that most existing DL architectures do "fast" thinking. Indeed, in terms of computational complexity, they literally embody fixed-cost computations - there is no variable-depth search or recursion, just a forward pass through a static set of components. But much what humans do when they do science is "slow" thinking - computations that are variable in length, involve fits and starts with feedback and refinement, and even involve other minds working collectively through peer review. And symbols are always involved in that process - they are a key mechanism for scientific knowledge to propagate itself from person to person and generation to generation. And I don't think you can do a lot of what we do without a deep development of this kind of slow, social, symbol-assisted cognition. Otherwise chimpanzees would be building bridges and airplanes and cellphones by now.