The lack of true reasoning ability (as humans’ reason), by EVs or LLMs or any other flavor of AI, is AI’s ultimate limitation. But, instead of recognizing this failure the tech world is now trying to redefine reasoning, claiming that the pattern matching in LLMs is proof of AI/LLMs ability to reason. (As additional proof LLMs ‘hallucinates” just like people do. NOT!) And while LLMs are, at times, accomplishing some remarkable feats, these remarkable feats have been latched on to as proof that it is reasoning. (one LLM did an almost remarkable job of reading and summarizing my book in 90 seconds, it only left out the one core important idea that takes up more pages than anything else in my book because it was something it had never experienced or read about before – is that a bit like a Tesla not seeing a parked jet?) This “reasoning” is at best what we could call a new infantile type of ‘computer reasoning’ which is nothing like how humans’ reason. Human reasoning takes in the whole of a situation, in context, relying on our full senses and faculties, integrating all this diverse information and processing so as to give meaning to what is we see/read/feel/experience and have experienced, and then make an intelligent decision of how to act.
I feel the same way about "hallucination" being used to describe false, invented claims of fact made by LLMs.
The problem is that all claims of fact made by LLMs are invented. Often these claims end up being true. Sometimes not. But there's no meaningful difference in what the LLM is doing under the hood when it makes false claims rather than true claims.
When a human being hallucinates, we imagine there is something interfering with their ordinary ability to perceive reality. When an LLM "hallucinates", it's doing the exact same thing it does when it tells to truth.
LLMs really aren’t good spacial reasoning engines because that’s not how they were trained. A team at MIT is working on AI with “Liquid Neurons” that can focus its attention like a driver does, which involves making changes to its neural pathways as the sensing data changes, hence the name “Liquid”. It’s much more effective at navigating roads without using maps which otherwise have to be accurate to the last centimetre. This is a case where linguistic ability isn’t the goal, but planning, focus, attention and actions are. Google is working on tokenising action sequences for its RT-2 robot. I suspect progress in automated driving will result from combining new architectures like these, not from more brute force efforts with LLMs.
We agree that brute force LLMs will not solve the driving challenge, nor will it lead to a functional AGI by itself, And Liquid Neurons are an intriguing potential for automated driving, but their ability to do anything close to the type, accuracy, flexibility and adaptability of reasoning required for a self-driving car is unknown.
Yes, that's true. The test video that I saw showed success in navigation and control of vehicle dynamics but not traffic management or complex reasoning as you describe. An ensemble of agents will be required to complete the driving task and so too, to reach AGI. (Like our brains have areas for language - speech and comprehension; vision; autonomic regulation; emotion; cognition; memory; reasoning; planning and self-consciousness etc.) LLMs represent only a fraction of these capabilities, even though they do a good job at their limited task.
Well stated. But what is missing in all current approaches (LLMs, ML, DL, ANNs, LN – Liquid Neurons…) is how the information/data that this ensemble of agents collects, that you defined, is fully integrated in a manner that some form of intelligent tech can process to find meaning in this data, and then instantly process, not to simply make a decision – but make the best decision possible (and defining ‘best decision’ adds even more complexity to the activity). And yet, this challenge is even more complex than it appears on the surface as the impact/value of each data point, from each agent, on all other data points and more importantly on the whole of the situation, and hence the tech’s decisioning, changes in each and every situation, which changes what is the best decision. With a self-driving car, the impact of each agent’s data point changes in each and every fraction of a second, so the decision must instantly be reevaluated, changed and executed properly integrating the new, and constantly changing situations. This is the level of reasoning required, and it requires looking at the whole of the situation that is to be processed as the starting point, the very first step, which is not the typical approach of the tech world.
Yes, I believe that’s the nature of the challenge at hand. The central organising agent needs to operate with undetectable lag (under 10ms) between perception and decision while processing a constant steam of data from all the perceptual agents. Bandwidth becomes an issue. Today’s 5G networks are too slow (20~25ms one way) for this to be exclusively a cloud based system so such a vehicle would need several extremely powerful GPUs to handle the load locally which no production vehicles currently have. You can imagine the system like a hub and spoke layout. Some labs are working on the “central workspace” decision making agent that can rapidly switch focus priority between the various perceptual inputs while others are looking at reducing the bandwidth of the data stream from the perceptual agents by training them to focus only on data relevant to the driving task. (This was the breakthrough achieved by the Liquid Neurons team at MIT). So for example, a camera sees leaves on a tree blowing in the wind and decides that this data can be ignored. Later it sees the branch of a tree laying across the road and it sends that to the central workspace because it’s relevant to the driving task. Contextual relevance is also critical. Recently in SFO an autonomous vehicle became bogged in wet concrete. To avoid this the vision system would need to be able to read and understand any warning signs or “caution” tape or barriers before reaching the wet concrete. It would then need to see and understand the physical difference between wet and dry concrete and adjust its route appropriately. Then the vehicle would alert the passengers to the need for a detour. So much needs to be done.
What bugs me about LLM "hallucinations" being compared to humans is - humans have flawed memories. LLMs don't. LLMs are frozen when they aren't being trained.
So if an LLM contains "reasoning" in its algorithms, then it should be able to either apply the correct knowledge every time, or miss it every time.
Since it's not doing that, I don't see how one can make a claim that it's reasoning as opposed to simply applying a very complex heuristic.
Allow me to clarify. The word hallucinations is not my term. Hallucinations is the tech world’s term to attempt to humanize LLMs, with the goal of glossing over the LLMs failures while trying to motivate people to still trust the LLM when they fail. The developers of LLMs are claiming LLMs reason, but they do not reason in any manner like human reasoning, as they have no ability to understand human issues, engage in meaning making or project and understand the potential impacts of a wrong answer. What LLMs are doing is hybrid logic/pattern matching, word probability projections along with black box stuff that the developers of each LLM cannot explain how their own systems work. More importantly and more accurately, LLMs do not hallucinate, they outright lie and make stuff up, sometime very dangerous stuff, when they don't have a clue how to respond to a query or request – as Gary has documented and written about extensively. A human can reason on a question or situation and recognize they do not know, and they can recognize that a wrong answer can be dangerous or deadly and they can go on to admit that they don’t know. LLMs can’t and don’t do that.
I think that the question is not how advanced AI (AGI) systems will do “reasoning” but will they generate valuable results, will they make good “decisions”. If we admit that future AI systems will be self-learning, self-optimizing and autonomous, even their creators-designers will not exactly know how they will “think”. Humans are capable of assessing data in some extent and doing multi-threads and multi-criteria reasoning in some extent, but in some extent only. I am afraid that in processing purely rationally very complex technical or organizational problems (data collecting, analysis, optimization, extrapolation, design or decision making) AI systems could be ultimately, in the long-term, better than humans.
Roman, you offer some interesting perspectives. I agree that in “processing purely rationally very complex technical” the key word here is technical, AI systems will in the long-term be better than humans. But when you add “organizational problems” that is an entirely different challenge. Technical problems can be complicated but usually a logical or rational solution can be found. Organizational problems, which can include organic complex adaptive systems including humans and human activities, human interactions and engagement are complex problems, these are fundamentally different challenges. Organizational systems require an entirely different kind and level of reasoning where processing power, or more data, or better algorithms will not alone provide the best decisions, or even the safest or the most human desired/accepted or “valuable” decisions. And for clarity, I am not sold on your view “that the future AI systems will be self-learning, self-optimizing and autonomous” when it applies to organizations, organisms, humans, and human activities. Unfortunately, as the tech AI world has already experienced with self-driving cars, this challenge is not a purely technical challenge but has major components that are constantly changing organic systems in the loop. And making this a bit more complex, self-driving cars also require a moral/ethical component where morals, ethics and fairness are not clearly defined, agreed upon and embraced by all.
Thank you Phil for addressing some of my considerations. For me, self-learning goes directly with self-improving and autonomy. If the system is allowed self-learning, he will increase its efficiency. And it will need nobody to set-up the training data base. So the self-learning is a very important threshold, because beyond this point the control over the machine will be only relative. The fundamental question arises should we allow the future AI systems to self-learn? As concerns, the difference between technical and organizational tasks I agree. I will add that the organizational ones, if they concern very large groups of people, many different materials and resources, various and changing economic and social circumstances, like for instance in the case of the drinking water shortage problem in some part of Africa, are far more complex and more challenging that the technical ones. And that’s why humans will be very tempted to use advanced AI for solving them. Because for this kind of issues there are too many parameters and too much variability for us to address them correctly, even with the help of computers making science-based simulations on particular aspects. The global comprehensive assessing of global (world scale) phenomena will need some kind of AGI. Among the various criteria of solution making there need to be morals and ethics. Before we allow such a system to operate, we should be sure that “human values” are well implemented. But will be this technically possible? Who will monitor it, who will guarantee it, the governments? are they reliable?
Roman, self-learning is a great feature in technology, but for an AI to learn accurately, it requires two things. First, there must be firm absolute boundaries to the situation it is trying to learn and second, there must be clear and hard fast rules that cannot be broken, ever. This is why AI did so well in learning chess, then Go and other games, clear absolute unbreakable boundaries it cannot cross and rules it cannot break. When a form of AI, such as an LLM attempts to ‘learn’ about any field or activity that does not have these two requirements it can learn the ‘general’ rules of something like a language, which LLMs have done, but cannot understand the meaning, hence it hallucinates. And at this time LLMs are allowed to self-learn and they do a good job, sometimes. But it is wrong and even deadly at other times. Making the situation worse, AIs/LLMs don’t know or acknowledge when it doesn’t know what it is talking about. I won’t fly on an airplane that lands safely sometimes or even most of the time. And as the self-driving world has learned after spending billions and with thousands of the best minds thrown at it, getting the basics of a self-driving was relatively easy, achieving the landing success of a airplane is not.
I totally agree that the amount of data in large groups/organizations is clearly far beyond humans to process. But the meaning within the data is not something an AI can find as there are no absolute boundaries and rules for it to find and follow. And since we do not have any clear agreed upon and adhered to and enforced ‘human values’ that can be programmed or turned into some form of constitution of absolute unbreakable rules, AI cannot accurately and safely do this alone. What is required is a new level and a new type of Human/AI collaboration, an AI to process and humans to give meaning, not an AGI that we allow to make decisions in complex, ever changing situations where human life and well-being are impacted.
I live in San Francisco and drive side by side daily w these cars. You make a fantastic argument for all the flag waivers on social media (nextdoor) who insist these are the way to go. Their argument is always that human drivers cause death and destruction. I have zero plans to get in one. I live in an extremely foggy area of the city. An area that once you crest the hill you feel as if you've entered another dimension. Doing this at night can be super stressful as your field of sight is just a few feet in front of you. I can't even imagine how Cruise or Waymo can get a robot car thru that not to mention the fog is here to stay.
For cars that only use cameras in the visible light spectrum that scenario is very risky. Automakers besides Tesla use a combination of visible light cameras, infra-red sensors, sonar, radar and LiDAR. Obviously these are more expensive so there’s an incentive to try to get by with cheap cameras and make up the difference in software by collecting lots of driver data. This has been the common approach by companies such as Tesla in the US. IMHO it should not be legal for automated use on public roads.
Machines (can) have much better vision than humans: airplanes routinely land in foggy conditions no human could land without crashing. The same holds true on land!
Airplanes are able to land in fog and reduced visibility conditions by utilizing a runway’s instrument approach system. Cars have no comparable support from the environment. I suppose some kind of guidance system could be installed in San Francisco, but I don't think it's practical to do it everywhere there might ever be fog.
In principle, cars might have sensors that work well in fog, but from what I've seen, these are currently bulky and expensive. Maybe they will become practical at some point.
Correct—you simply cannot compare roads and streets to airplane runways. Runways are fixed infrastructure integrated into the comms and logistics ecosystem of an airport—cars are not plugged into the communications systems of the roads and streets. So no, the same does not hold true on land.
Planes have air traffic controllers assisting. And for the record, I have yet to see one robot car on my block or surrounding blocks... 🤔, Wonder why that is?? I don't think it's from being "incorrect"
I think the name for the current approach to self driving cars should be called "now I've seen everything". The hope is that by putting in milliions of hours of 'play' like they do with video games, they will get all the edge cases of interest.
Human babies are not trained on billions of carefully honed examples, but rather with small numbers of experiences, often self-created. Moreover, children have an ability, unknown to current machine learning algorithms, to flexibly apply lessons from one area of learning to dramatically different areas
with seeming ease.
Giant monolithic neural networks are do not seem to be exhibiting the kinds of learning performance we require, even with very large numbers of layers and nodes. They still require far more data examples than humans, in order to perform far less capably in general intelligence. I don't think the "now I've seen everything" approach will work for the real world. Instead we must strive design new algorithms which can learn in the extremely parsimonious ways that humans do.
My dismal analogy for how to make these less-than-perfect self-driving vehicles safer is the conversion of roads and streets to help automobiles: reconfigure the whole world to accommodate them. Curbs to prevent cars driving on to sidewalks, which exist to keep pedestrians off the streets. Stoplights exist to tell pedestrians when it's safe to cross.
Essentially, reconfigure everything to try to eliminate edge cases. I predict the best result will be more efficient parking :-)
There are earlier examples of this 'limit the domain' success. In the 1990's the recognition of handwritten addresses (in itself at the time about 70% reliable regardless of pixel or vector approaches) became almost perfect when suddenly a smart engineer realised that the combinations of steers and cities and postal codes are very limited. So, if you hade 3-4 guesses for the street and 3-4 for the city and 3-4 for the postal code, you could almost perfectly 'read' because of those combinations, only 1 actually existed. Earlier, people created other micro-worlds. Limiting degrees of freedom to make stuff work has been with us for ages. So, indeed, we might see a massive investment in infrastructure to turn free-for-all roads into a sort of track-infra. Probably only in cities and highways.
The initiative towards self driving cars might have been a good one, but it appears we've reached a point where it can justifiably be characterised as a scam. If your remit was public safety, all those $$ and brainpower could have been put to far better use. There's no getting round it, edge cases is where it's at. In this regard, I'd still trust a one-eyed human driver lacking depth perception more than I would an AI bristling with sensors.
Edge cases there will always be. It's the reality of our beautifully chaotic world. It's the very nature of our universe. In fact, an "edge case" is really what's "normal" -- all those moments that take place 24/7, like cars driving, people walking, animals crossing, birds flying across, trees falling into, our streets and roadways and parking lots, those are all edge cases in one sense or another, and collectively they make up what we understand and experience as the real world.
Problem with the engineering mindset is, we want our data nice and clean and predictable. Sorry, that's not the real world — and thankfully so. Personally I would find a predictable, homogeneous world frightfully dull.
Good point about training cars in California sprawl versus an environment like New York City. Driving in New York is very personally interactive, involving a lot of guessing as to the other drivers’ intent, competitive merging contests, and many more encounters with erratic pedestrians, bicycles, and mopeds.
I'd add this: the unsurmountable limitation of the Physical Symbol System Hypothesis is what the SDC failure is about. Embodied biological beings (eg humans) experience the world (eg cars, roads, weather, traffic etc) 'directly'. It's that simple, that stark. In other words, if an SDC can literally FEEL the bumps on the road (for ex), we'd be on the road (pun intended) to L10M (as opposed to mere "L5") SDCs. Adding more data won't ever fix this, including driving a trillion miles in VR. Why (not)? Because real world > virtual world.
The question we should be asking is, “How quickly can we ban human drivers?” Human drivers kill other humans – a 4-year old girl in her stroller just this week, 37 people in SF and 42,795 people across the US last year alone, Cruise and Waymo, while not yet perfect, have never killed anyone.
The ethics are incontrovertible. Humans must turn over driving to machines.
Yeah...actually, Gary, he did. The problem is one of perceived control and safety. When I still had my plane, I wouldln't fly it unless the autopilot was working. Why not? Couple reasons:
1) It's impossible for a human to fly as smoothly as a well-functioning computer. See Adrian's comments.
2) It's very easy for a human to become "task-saturated" which takes away from being able to use mental resources to solve problems as they arise. In the air, for example, I might be managing file flow or figuring out how to fly around a thunderstorm. [You ~never~ want to fly through one.] At first I was wary about it and thought I wouldn't be a "real"pilot if I used it. Didn't take long for me to realize the folly of that line of thought.
I see self-driving cars in a similar vein. So far, the accident rate per hour driven is looking much better for SDs than humans behind the wheel. All of which said, I still like driving, and driving fast! Which Adrian also covers in his list....
Indeed, we are trying to compare two rates, so denominators matter, as do sampling errors. But there is a deeper issue: a rate of 0 may not be a signal of just quantitative difference, but that there is a qualitative difference between the populations.
Let’s start by calculating the rates. It’s hard to get accurate numbers for the denominators, but let’s give it a try. EE Times (“Waymo, Cruise Dominate AV Testing”, 2023-04-13) relied on CA DMV data to report that Waymo and Cruise drove a total of 3.8 million autonomous miles in California in 2022. The SF Municipal Transport Agency (“SF Mobility Trend Report 2018”) reported there were 5.6 million miles per day driven by humans in SF in 2016, i.e. 2,044 million miles per year. So the fatality rates are: Cruise & Waymo: 0.00/billion miles; Humans: 18.4/billion miles (37/2.044).
The question is – are these two rates different? One might be tempted to think that because the AV sample size is 526 times smaller, the error must be 526x larger. However, because sampling errors are inversely proportional to the square root of the sample size, the difference in sample sizes matters far less than one would naively think.
However, the key concern really is not a statistical one – it is a categorical one. Are these two populations (AVs vs human drivers) inherently different in the way which they cause fatalities? The answer is YES.
Human drivers inherently will kill other humans:
- humans are incapable of paying constant, unwavering attention to driving
- humans have a single, passive visual sensor capable of about 6° of arc
- humans (as a group) are incapable of following the rules of the road
- humans are incapable of improving their driving except by individual instruction over many hours
AVs are purpose-designed not to kill or injure humans.
- AVs have nothing but constant, unwavering attention to the task at hand
- AVs have as many as needed active and passive sensors covering 360°
- AVs do follow the rules of the road (which is why human drivers find them annoying)
- an entire fleet of AVs are capable of instantly improving their driving by being reprogrammed.
The only ethical course of action is to push ahead with the transition to AVs as quickly as possible
In my opinion, it's not going to happen until self-driving cars no longer make errors that no human would ever make -- like colliding with a firetruck. It won't be enough that they're statistically safer, even if they are. They will have to demonstrably be no worse than humans at handling situations that humans reliably get right. Until then, people won't trust them (and I wouldn't either).
Yes, consider a wounded person on a sidewalk - will a car pick that person and get him/her to a hospital or at least call an ambulance? It may need to consult with a current passenger first - will it? Simple cases, no simple solutions, well, my solution is not simple.
I forgot that humans don’t collide into fire trucks (along with curbs, stores, signposts, puppies, and children). I think I need to stop reading comments on these articles.
Thank you. I made a similar comment just now. Fatal errors due to absence of human common sense will HORRIFY ordinary people and kill the industry. However, if these cars make cab fares super cheap, some people will always be willing to pay.
Stunning how we must have self-driving cars at any cost. But consider improving mass transit? Not exciting enough. It's like we're trying to live a fantasy rather than solve real problems.
This article reaches the same conclusion, but takes a different tack, which I think should be added to your pile of reasons we aren’t about to get full-self-driving cars: Consumers aren’t going to tolerate a product whose mistakes are not understandable and reasonable.
Here’s the tl;dr:
"how and when they fail matters a lot…. If their mistakes mimic human errors … people are likely to be more accepting of the new technology. …But if their failures seem bizarre and unpredictable, adoption of this nascent technology will encounter serious resistance. Unfortunately, this is likely to remain the case."
agreed, though of course at some point – not soon— one could envision customer resistance being irrational (eg if driverless cars driven same miles/conditions etc led to 1/4 of the fatalities but the ones it caused were different and weird and unreasoable) ps thanks for subscribing!
I think, as a matter of consumer and legislator psychology, that if the failure modes remain unreasonable, the statistics will have to be better by at least an order of magnitude before people accept SDCs. Irrational? Perhaps. But people will want to have a mental model of how these things work and when they're likely to fail, and if no such model is evident, or it's too different from human capabilities for people to grasp, they're going to be very uncomfortable.
Great article, as usual. And more like it will be needed. The edge case problem is a painful and persistent thorn on the side of the autonomous car industry. If it weren't for politics, no existing self-driving vehicle would be allowed on public roads. To do so is criminal, in my opinion.
The edge case problem is a real show stopper in our quest to solve AGI. There is no question that a truly full self-driving car will need an AGI solution. The current machine learning paradigm is no different in principle than the rule-based systems of the last century. Deep learning systems just have more rules but the fragility remains. AGI researchers should take a lesson from the lowly honeybee. It has less than a million neurons but it can navigate and operate in extremely complex and unpredictable environments. Edge cases are not a problem for the bee. How come? It is because the bee can generalize, that is, it can reuse existing cognitive structures in new situations.
We will not crack AGI unless and until generalization is solved. Based on my study of the capabilities of insects, it is my considered opinion that a full self-driving car is achievable with less than 100 million neurons and a fully generalized brain. Deep learning will not be part of the solution, that's for sure. Regardless of the usual protestations, DL cannot generalize by design.
Edge cases could be solved by large language models reasoning through a possible scenario when confronted with novel situations. Given their increasing performance on zero-shot tasks, I would think that incorporating a fine-tuned language model into the FSD stack is a workable solution.
Terrible idea. For this to even begin to work, the car would have to generate a verbal description of the scene before it, capturing only the most relevant aspects, and all in milliseconds. We don't have anything that can do that. Then, the text the LLM was trained on would have to have contained millions, probably, of examples of drivers analyzing such situations verbally and deciding what to do. Humans don't even do that; where would such text be sourced from?
LLMS ARE NOT (M)AGI(C). If you take anything away from reading Gary, it should be that.
No, it's not. You are being silly. Here are some considerations:
1. The car does not necessarily have to react in milliseconds, That is an artificial metric. Reactions on the order of hundreds of milliseconds are fine. A small model running on custom hardware could theoretically output hundreds of tokens per second, allowing for analysis and responses on human scale timeframes.
2. You could easily run a 3-billion parameter model that can take an input from a classifier, identify the situation and output solutions which are fed back into the vehicle control. It doesn't have to be trained on "millions of examples" of conversations about driving, the representation of how to safely drive and examples of edge cases are already encoded in the models. In fact, with sufficient fine-tuning, it could easily become the most knowledgable driver on the road.
3. Go ahead and ask ChatGPT, "If you are driving next to parked cars and you see a pedestrian step behind one out of your sight, what happens next?" and see how it responds. It answers like a prudent and cautious driver would. It already knows what to do.
4. Now go ahead and ask it, "If you are driving a car and see a section of the road that has a sign that reads "wet cement", what should you do?" and see what is says.
5. Nobody said anything about A.G.I.. What I proposed was merging self-driving models, such as the vision model generated by a vehicle, with a LLM, as it adds in knowledge and safety and can handle the edge cases that vision models simply cannot. I.e. it adds an additional domain of intelligence that can be used to guide the vehicle. LLMs have varying degrees of zero-shot performance but the purported "edge case" problem may simply exist because of the limited architecture employed.
I would like to hear what Gary thinks about this idea, as opposed to simply reading a knee-jerk reaction.
No, it is not. You could easily have a small, fine-tuned model specifically for driving tasks that is safe and reliable. It is not a chatbot, wouldn't have the same scope of inputs nor necessarily the same context window size. Only takes in a limited set of data from the classifier, not asking it to generate citations for authors.
Let's check back in five years and see where we are with the self-driving stack.
How about making them output first-order logic or something? Then it could be checked for soundness before making a decision.
This could also solve the "humans learn from a few examples" problem, because the AI could then link this logic, perhaps in a semantic net, to a vast amount of previously accumulated knowledge and thus be able to acquire immense amounts of new knowledge through inference rules later.
You seem to think that generating a verbal description of an image is a solved problem in machine vision. Not even close.
In one example Gary references, an SDC collided with a firetruck. What does that tell us? It failed to detect that a large, brightly colored vehicle was in its vicinity. If it can't even detect that fact, it's not going to be able to generate a sentence like "There's a firetruck in front of me" to feed to the LLM. That's the first problem: before we can produce a verbal representation, we need a visual/spatial representation of what objects are in the scene, some of which may be partially occluded by others, and we can't even build that now. I'm not up on the latest work in this area, but I understand that Geoff Hinton invented capsule networks in an attempt to make progress on this problem, and nobody is using them; I guess they don't have an efficient GPU implementation.
The second problem is that even if you can produce a textual description of the scene, you have to decide what to put in it and what to leave out. In a busy city intersection, there may be dozens of other vehicles, cyclists, and pedestrians in your field of view, not to mention construction sites, traffic cones, and other stationary objects. The model has to focus on the relevant ones, and omit or summarize the rest; otherwise it will take multiple paragraphs to describe the scene, which will take too long and swamp the LLM.
That's exacerbated by the third problem, which is that a lot of the relevant information is in the visual/spatial modality: the location and vector velocity of each object. Adding that in will further bloat the description.
Consider your example about seeing a "Wet Cement" sign. The correct action depends on more details: is the wet cement on a part of the roadway that the car is on course to drive over? Or is it, perhaps, just on the sidewalk, in which case the car should ignore it? The distinction is crucial, and a sufficiently detailed description of the scene to capture it is going to have a lot of irrelevant details also.
As for the latency requirement, okay, maybe it's a few hundred milliseconds, but remember, generation of the verbal description is only the first step; the LLM would still have to process it, and further processing would be necessary to determine how to drive the actuators in light of the LLM's analysis.
And then there's the power budget. Currently, from what I can find, ChatGPT inference takes on the order of 2Wh/query; so to do one query per second, which I'm not sure is enough, would take 7200W. That's already 100 times the estimates I was able to find of the power consumption of Tesla's current chip. A smaller model and custom silicon would bring it down, but it would have to come down two orders of magnitude just to get within the realm of possibility — and that's just for the LLM itself.
The real question to ask here is, do we get any leverage on the problem by changing modalities from visual/spatial to verbal? Notwithstanding your argument, I submit that the answer is clearly no. Any of these cases that we can represent textually, we can represent and reason about more generally and efficiently in the visual/spatial domain.
"You seem to think that generating a verbal description of an image is a solved problem in machine vision. Not even close."
No, it is pretty close to solved. Google has it running with Bard and it is pretty accurate. There are over 228 models of various sizes and capabilities available on hugginface to download and run locally.
Again, the idea is not to replace the current visual/spatial model with an LLM, but to add intelligence into the equation to deal with the edge cases, which is the topic Gary is addressing in his post. As I mentioned previously, you don't have to have a model the size of ChatGPT, but a fine-tuned small model, say three or seven billion parameters, that has been trained on Level 4 disengagement data. And as the cost and power requirements of compute fall over time, it will be possible to have the reasoning of a language model interacting with the other driving modalities in real time.
The issue with edge cases could conceivably be dealt with by adding the knowledge and reasoning of a language model. And the attention problem could be sorted by a model trained on disengagements so that it values image to text inputs that are highly weighted internally with warnings.
Will it be fast enough to work in real-time? Probably. Maybe in five years. But the takeaway is that you can address the edge case issue, maybe very well. Combined with the other safety benefits of self-driving, such as faster responses to hazards and the fact that the driver isn't intoxicated, could result in 10x improvements in safety over human drivers.
What is missing in all current approaches (LLMs, ML, DL, ANNs, LN – Liquid Neurons…) is how the information/data that an ensemble of agents collects, vehicle dynamics, traffic management, road and weather conditions, known/anticipated situations/obstacles and preparation for unknown situations/obstacles etc. is fully integrated in a manner that some form of intelligent tech can process to find meaning in this data, and then instantly make a decision, not just any decision – but make the best decision possible (and defining ‘best decision’ adds even more complexity to the activity). And yet, this challenge is even more complex than it appears on the surface as the impact/value of each data point, from each agent, has on all other data points and more importantly on the whole of the situation must be included in the calculations, and hence the tech’s decisioning, changes in each and every situation, potentially changing what is the best decision to make and what action to take, or not to take. With a self-driving car, the impact of each agent’s data point can change in a fraction of a second, so the prior decision(s) must instantly be reevaluated, potentially changed and new actioning executed properly with integration of the new data in a continuously changing situation. This is the level of “reasoning” required. To achieve this level of reasoning requires a very different approach, starting with looking/defining the whole of the situation that tech is to process as the very first step, and what are the acceptable kind of decisions and actions desired for the tech to arrive at and execute, and are these even possible. This is not the typical approach of the tech world.
"The issue with edge cases could conceivably be dealt with by adding the knowledge and reasoning of a language model" except that LLMs have neither.
There is no fundamental difference between a 'hallucination' and 'correct' output of LLMs. It is all a form of hallucination, it just happens that where the LLM has been fed with enough data, it will correctly create hallucinations that are reliable 'far from the edges'. LLMs are huge statistical inference machines that reliably can mimic things like language structure (because of how it handles language — not as what we would call language). Basically, when LLMs answer correctly they are 'hallucinating a reality'.
It is not that we cannot do very useful and powerful things with transformers and their ilk, but they are statistic machines and hence by definition not reliable in edge cases. But they do seem reliable to us in part because they produce perfect language.
Do you not think it's *a matter of time* until driverless are on balance at least as safe as drivers? If you do think so, then what does your distribution of the arrival time look like?
As Scott Burson wrote above “ it's not going to happen until self-driving cars no longer make errors that no human would ever make -- like colliding with a firetruck. It won't be enough that they're statistically safer, even if they are. They will have to demonstrably be no worse than humans at handling situations that humans reliably get right. Until then, people won't trust them (and I wouldn't either).”
I still shudder thinking about how many dead escooter riders we would have had to bury if driverless cars were ubiquitous when the scooters first appeared on streets in March 2018.
If we are requiring for the automated driving systems a 100 % reliability in all circumstances, they will never be allowed to operate. But what is a reliability (probability of error, of accident) for the best, the most expert human driver? The question is how far these systems can be ultimately better than humans. They are not yet ready obviously. The billions spent by car manufacturers were not enough apparently. But hundreds of billions are now spent all over the world on AI algorithms for all kind of applications and self-driving cars could quickly benefit from the general progress in this technology. With more computer power, with more efficient learning algorithms, a lot of ‘edge cases’ could be incorporated in the training data set and the system reliability could approach the desired level.
The lack of true reasoning ability (as humans’ reason), by EVs or LLMs or any other flavor of AI, is AI’s ultimate limitation. But, instead of recognizing this failure the tech world is now trying to redefine reasoning, claiming that the pattern matching in LLMs is proof of AI/LLMs ability to reason. (As additional proof LLMs ‘hallucinates” just like people do. NOT!) And while LLMs are, at times, accomplishing some remarkable feats, these remarkable feats have been latched on to as proof that it is reasoning. (one LLM did an almost remarkable job of reading and summarizing my book in 90 seconds, it only left out the one core important idea that takes up more pages than anything else in my book because it was something it had never experienced or read about before – is that a bit like a Tesla not seeing a parked jet?) This “reasoning” is at best what we could call a new infantile type of ‘computer reasoning’ which is nothing like how humans’ reason. Human reasoning takes in the whole of a situation, in context, relying on our full senses and faculties, integrating all this diverse information and processing so as to give meaning to what is we see/read/feel/experience and have experienced, and then make an intelligent decision of how to act.
I feel the same way about "hallucination" being used to describe false, invented claims of fact made by LLMs.
The problem is that all claims of fact made by LLMs are invented. Often these claims end up being true. Sometimes not. But there's no meaningful difference in what the LLM is doing under the hood when it makes false claims rather than true claims.
When a human being hallucinates, we imagine there is something interfering with their ordinary ability to perceive reality. When an LLM "hallucinates", it's doing the exact same thing it does when it tells to truth.
Which is why LLMs are dangerous, it cannot truly reason to tell the truth from the false or the safe from the dangerous.
LLMs really aren’t good spacial reasoning engines because that’s not how they were trained. A team at MIT is working on AI with “Liquid Neurons” that can focus its attention like a driver does, which involves making changes to its neural pathways as the sensing data changes, hence the name “Liquid”. It’s much more effective at navigating roads without using maps which otherwise have to be accurate to the last centimetre. This is a case where linguistic ability isn’t the goal, but planning, focus, attention and actions are. Google is working on tokenising action sequences for its RT-2 robot. I suspect progress in automated driving will result from combining new architectures like these, not from more brute force efforts with LLMs.
We agree that brute force LLMs will not solve the driving challenge, nor will it lead to a functional AGI by itself, And Liquid Neurons are an intriguing potential for automated driving, but their ability to do anything close to the type, accuracy, flexibility and adaptability of reasoning required for a self-driving car is unknown.
Yes, that's true. The test video that I saw showed success in navigation and control of vehicle dynamics but not traffic management or complex reasoning as you describe. An ensemble of agents will be required to complete the driving task and so too, to reach AGI. (Like our brains have areas for language - speech and comprehension; vision; autonomic regulation; emotion; cognition; memory; reasoning; planning and self-consciousness etc.) LLMs represent only a fraction of these capabilities, even though they do a good job at their limited task.
Well stated. But what is missing in all current approaches (LLMs, ML, DL, ANNs, LN – Liquid Neurons…) is how the information/data that this ensemble of agents collects, that you defined, is fully integrated in a manner that some form of intelligent tech can process to find meaning in this data, and then instantly process, not to simply make a decision – but make the best decision possible (and defining ‘best decision’ adds even more complexity to the activity). And yet, this challenge is even more complex than it appears on the surface as the impact/value of each data point, from each agent, on all other data points and more importantly on the whole of the situation, and hence the tech’s decisioning, changes in each and every situation, which changes what is the best decision. With a self-driving car, the impact of each agent’s data point changes in each and every fraction of a second, so the decision must instantly be reevaluated, changed and executed properly integrating the new, and constantly changing situations. This is the level of reasoning required, and it requires looking at the whole of the situation that is to be processed as the starting point, the very first step, which is not the typical approach of the tech world.
Yes, I believe that’s the nature of the challenge at hand. The central organising agent needs to operate with undetectable lag (under 10ms) between perception and decision while processing a constant steam of data from all the perceptual agents. Bandwidth becomes an issue. Today’s 5G networks are too slow (20~25ms one way) for this to be exclusively a cloud based system so such a vehicle would need several extremely powerful GPUs to handle the load locally which no production vehicles currently have. You can imagine the system like a hub and spoke layout. Some labs are working on the “central workspace” decision making agent that can rapidly switch focus priority between the various perceptual inputs while others are looking at reducing the bandwidth of the data stream from the perceptual agents by training them to focus only on data relevant to the driving task. (This was the breakthrough achieved by the Liquid Neurons team at MIT). So for example, a camera sees leaves on a tree blowing in the wind and decides that this data can be ignored. Later it sees the branch of a tree laying across the road and it sends that to the central workspace because it’s relevant to the driving task. Contextual relevance is also critical. Recently in SFO an autonomous vehicle became bogged in wet concrete. To avoid this the vision system would need to be able to read and understand any warning signs or “caution” tape or barriers before reaching the wet concrete. It would then need to see and understand the physical difference between wet and dry concrete and adjust its route appropriately. Then the vehicle would alert the passengers to the need for a detour. So much needs to be done.
What bugs me about LLM "hallucinations" being compared to humans is - humans have flawed memories. LLMs don't. LLMs are frozen when they aren't being trained.
So if an LLM contains "reasoning" in its algorithms, then it should be able to either apply the correct knowledge every time, or miss it every time.
Since it's not doing that, I don't see how one can make a claim that it's reasoning as opposed to simply applying a very complex heuristic.
Allow me to clarify. The word hallucinations is not my term. Hallucinations is the tech world’s term to attempt to humanize LLMs, with the goal of glossing over the LLMs failures while trying to motivate people to still trust the LLM when they fail. The developers of LLMs are claiming LLMs reason, but they do not reason in any manner like human reasoning, as they have no ability to understand human issues, engage in meaning making or project and understand the potential impacts of a wrong answer. What LLMs are doing is hybrid logic/pattern matching, word probability projections along with black box stuff that the developers of each LLM cannot explain how their own systems work. More importantly and more accurately, LLMs do not hallucinate, they outright lie and make stuff up, sometime very dangerous stuff, when they don't have a clue how to respond to a query or request – as Gary has documented and written about extensively. A human can reason on a question or situation and recognize they do not know, and they can recognize that a wrong answer can be dangerous or deadly and they can go on to admit that they don’t know. LLMs can’t and don’t do that.
I think that the question is not how advanced AI (AGI) systems will do “reasoning” but will they generate valuable results, will they make good “decisions”. If we admit that future AI systems will be self-learning, self-optimizing and autonomous, even their creators-designers will not exactly know how they will “think”. Humans are capable of assessing data in some extent and doing multi-threads and multi-criteria reasoning in some extent, but in some extent only. I am afraid that in processing purely rationally very complex technical or organizational problems (data collecting, analysis, optimization, extrapolation, design or decision making) AI systems could be ultimately, in the long-term, better than humans.
Roman, you offer some interesting perspectives. I agree that in “processing purely rationally very complex technical” the key word here is technical, AI systems will in the long-term be better than humans. But when you add “organizational problems” that is an entirely different challenge. Technical problems can be complicated but usually a logical or rational solution can be found. Organizational problems, which can include organic complex adaptive systems including humans and human activities, human interactions and engagement are complex problems, these are fundamentally different challenges. Organizational systems require an entirely different kind and level of reasoning where processing power, or more data, or better algorithms will not alone provide the best decisions, or even the safest or the most human desired/accepted or “valuable” decisions. And for clarity, I am not sold on your view “that the future AI systems will be self-learning, self-optimizing and autonomous” when it applies to organizations, organisms, humans, and human activities. Unfortunately, as the tech AI world has already experienced with self-driving cars, this challenge is not a purely technical challenge but has major components that are constantly changing organic systems in the loop. And making this a bit more complex, self-driving cars also require a moral/ethical component where morals, ethics and fairness are not clearly defined, agreed upon and embraced by all.
Thank you Phil for addressing some of my considerations. For me, self-learning goes directly with self-improving and autonomy. If the system is allowed self-learning, he will increase its efficiency. And it will need nobody to set-up the training data base. So the self-learning is a very important threshold, because beyond this point the control over the machine will be only relative. The fundamental question arises should we allow the future AI systems to self-learn? As concerns, the difference between technical and organizational tasks I agree. I will add that the organizational ones, if they concern very large groups of people, many different materials and resources, various and changing economic and social circumstances, like for instance in the case of the drinking water shortage problem in some part of Africa, are far more complex and more challenging that the technical ones. And that’s why humans will be very tempted to use advanced AI for solving them. Because for this kind of issues there are too many parameters and too much variability for us to address them correctly, even with the help of computers making science-based simulations on particular aspects. The global comprehensive assessing of global (world scale) phenomena will need some kind of AGI. Among the various criteria of solution making there need to be morals and ethics. Before we allow such a system to operate, we should be sure that “human values” are well implemented. But will be this technically possible? Who will monitor it, who will guarantee it, the governments? are they reliable?
Roman, self-learning is a great feature in technology, but for an AI to learn accurately, it requires two things. First, there must be firm absolute boundaries to the situation it is trying to learn and second, there must be clear and hard fast rules that cannot be broken, ever. This is why AI did so well in learning chess, then Go and other games, clear absolute unbreakable boundaries it cannot cross and rules it cannot break. When a form of AI, such as an LLM attempts to ‘learn’ about any field or activity that does not have these two requirements it can learn the ‘general’ rules of something like a language, which LLMs have done, but cannot understand the meaning, hence it hallucinates. And at this time LLMs are allowed to self-learn and they do a good job, sometimes. But it is wrong and even deadly at other times. Making the situation worse, AIs/LLMs don’t know or acknowledge when it doesn’t know what it is talking about. I won’t fly on an airplane that lands safely sometimes or even most of the time. And as the self-driving world has learned after spending billions and with thousands of the best minds thrown at it, getting the basics of a self-driving was relatively easy, achieving the landing success of a airplane is not.
I totally agree that the amount of data in large groups/organizations is clearly far beyond humans to process. But the meaning within the data is not something an AI can find as there are no absolute boundaries and rules for it to find and follow. And since we do not have any clear agreed upon and adhered to and enforced ‘human values’ that can be programmed or turned into some form of constitution of absolute unbreakable rules, AI cannot accurately and safely do this alone. What is required is a new level and a new type of Human/AI collaboration, an AI to process and humans to give meaning, not an AGI that we allow to make decisions in complex, ever changing situations where human life and well-being are impacted.
I live in San Francisco and drive side by side daily w these cars. You make a fantastic argument for all the flag waivers on social media (nextdoor) who insist these are the way to go. Their argument is always that human drivers cause death and destruction. I have zero plans to get in one. I live in an extremely foggy area of the city. An area that once you crest the hill you feel as if you've entered another dimension. Doing this at night can be super stressful as your field of sight is just a few feet in front of you. I can't even imagine how Cruise or Waymo can get a robot car thru that not to mention the fog is here to stay.
agree that could get ugly. if they have any sense they will greatly restrict drivijg in fogging areas .
For cars that only use cameras in the visible light spectrum that scenario is very risky. Automakers besides Tesla use a combination of visible light cameras, infra-red sensors, sonar, radar and LiDAR. Obviously these are more expensive so there’s an incentive to try to get by with cheap cameras and make up the difference in software by collecting lots of driver data. This has been the common approach by companies such as Tesla in the US. IMHO it should not be legal for automated use on public roads.
Exactly why I would never ride in one.
How incorrect!
Machines (can) have much better vision than humans: airplanes routinely land in foggy conditions no human could land without crashing. The same holds true on land!
Airplanes are able to land in fog and reduced visibility conditions by utilizing a runway’s instrument approach system. Cars have no comparable support from the environment. I suppose some kind of guidance system could be installed in San Francisco, but I don't think it's practical to do it everywhere there might ever be fog.
In principle, cars might have sensors that work well in fog, but from what I've seen, these are currently bulky and expensive. Maybe they will become practical at some point.
Correct—you simply cannot compare roads and streets to airplane runways. Runways are fixed infrastructure integrated into the comms and logistics ecosystem of an airport—cars are not plugged into the communications systems of the roads and streets. So no, the same does not hold true on land.
Exactly. It's not in anyway apples to apples.
Cars can use LIDAR.
Planes have air traffic controllers assisting. And for the record, I have yet to see one robot car on my block or surrounding blocks... 🤔, Wonder why that is?? I don't think it's from being "incorrect"
I think the name for the current approach to self driving cars should be called "now I've seen everything". The hope is that by putting in milliions of hours of 'play' like they do with video games, they will get all the edge cases of interest.
Human babies are not trained on billions of carefully honed examples, but rather with small numbers of experiences, often self-created. Moreover, children have an ability, unknown to current machine learning algorithms, to flexibly apply lessons from one area of learning to dramatically different areas
with seeming ease.
Giant monolithic neural networks are do not seem to be exhibiting the kinds of learning performance we require, even with very large numbers of layers and nodes. They still require far more data examples than humans, in order to perform far less capably in general intelligence. I don't think the "now I've seen everything" approach will work for the real world. Instead we must strive design new algorithms which can learn in the extremely parsimonious ways that humans do.
Yes. Humans generalize while deep learning specializes. Even a honeybee with less than a million neurons can generalize with ease.
My dismal analogy for how to make these less-than-perfect self-driving vehicles safer is the conversion of roads and streets to help automobiles: reconfigure the whole world to accommodate them. Curbs to prevent cars driving on to sidewalks, which exist to keep pedestrians off the streets. Stoplights exist to tell pedestrians when it's safe to cross.
Essentially, reconfigure everything to try to eliminate edge cases. I predict the best result will be more efficient parking :-)
There are earlier examples of this 'limit the domain' success. In the 1990's the recognition of handwritten addresses (in itself at the time about 70% reliable regardless of pixel or vector approaches) became almost perfect when suddenly a smart engineer realised that the combinations of steers and cities and postal codes are very limited. So, if you hade 3-4 guesses for the street and 3-4 for the city and 3-4 for the postal code, you could almost perfectly 'read' because of those combinations, only 1 actually existed. Earlier, people created other micro-worlds. Limiting degrees of freedom to make stuff work has been with us for ages. So, indeed, we might see a massive investment in infrastructure to turn free-for-all roads into a sort of track-infra. Probably only in cities and highways.
Right. Out in the country, we still don't have sidewalks.
Well said Gerben Wierda!
Agree, but then self-driving will be more akin to old-fashioned engineering than AI.
The initiative towards self driving cars might have been a good one, but it appears we've reached a point where it can justifiably be characterised as a scam. If your remit was public safety, all those $$ and brainpower could have been put to far better use. There's no getting round it, edge cases is where it's at. In this regard, I'd still trust a one-eyed human driver lacking depth perception more than I would an AI bristling with sensors.
Edge cases there will always be. It's the reality of our beautifully chaotic world. It's the very nature of our universe. In fact, an "edge case" is really what's "normal" -- all those moments that take place 24/7, like cars driving, people walking, animals crossing, birds flying across, trees falling into, our streets and roadways and parking lots, those are all edge cases in one sense or another, and collectively they make up what we understand and experience as the real world.
Problem with the engineering mindset is, we want our data nice and clean and predictable. Sorry, that's not the real world — and thankfully so. Personally I would find a predictable, homogeneous world frightfully dull.
I dove into this rabbit hole a few months ago, how curious I also naturally chose autonomous cars to discuss edge cases... https://themuse.substack.com/p/death-by-a-thousand-edge-cases
Good point about training cars in California sprawl versus an environment like New York City. Driving in New York is very personally interactive, involving a lot of guessing as to the other drivers’ intent, competitive merging contests, and many more encounters with erratic pedestrians, bicycles, and mopeds.
And I used to ride a unicycle there; I would never do that in the era of testing quasi-autonomous-cars.
You’re a much cooler dude than I had imagined.
Hi Gary, excellent post (as always - duh!)...
I'd add this: the unsurmountable limitation of the Physical Symbol System Hypothesis is what the SDC failure is about. Embodied biological beings (eg humans) experience the world (eg cars, roads, weather, traffic etc) 'directly'. It's that simple, that stark. In other words, if an SDC can literally FEEL the bumps on the road (for ex), we'd be on the road (pun intended) to L10M (as opposed to mere "L5") SDCs. Adding more data won't ever fix this, including driving a trillion miles in VR. Why (not)? Because real world > virtual world.
Also, very thoughtful analyses: https://rodneybrooks.com/category/dated-predictions/
The question we should be asking is, “How quickly can we ban human drivers?” Human drivers kill other humans – a 4-year old girl in her stroller just this week, 37 people in SF and 42,795 people across the US last year alone, Cruise and Waymo, while not yet perfect, have never killed anyone.
The ethics are incontrovertible. Humans must turn over driving to machines.
hint: you are looking at a numerator but failing to look at the denominator
Yeah...actually, Gary, he did. The problem is one of perceived control and safety. When I still had my plane, I wouldln't fly it unless the autopilot was working. Why not? Couple reasons:
1) It's impossible for a human to fly as smoothly as a well-functioning computer. See Adrian's comments.
2) It's very easy for a human to become "task-saturated" which takes away from being able to use mental resources to solve problems as they arise. In the air, for example, I might be managing file flow or figuring out how to fly around a thunderstorm. [You ~never~ want to fly through one.] At first I was wary about it and thought I wouldn't be a "real"pilot if I used it. Didn't take long for me to realize the folly of that line of thought.
I see self-driving cars in a similar vein. So far, the accident rate per hour driven is looking much better for SDs than humans behind the wheel. All of which said, I still like driving, and driving fast! Which Adrian also covers in his list....
Indeed, we are trying to compare two rates, so denominators matter, as do sampling errors. But there is a deeper issue: a rate of 0 may not be a signal of just quantitative difference, but that there is a qualitative difference between the populations.
Let’s start by calculating the rates. It’s hard to get accurate numbers for the denominators, but let’s give it a try. EE Times (“Waymo, Cruise Dominate AV Testing”, 2023-04-13) relied on CA DMV data to report that Waymo and Cruise drove a total of 3.8 million autonomous miles in California in 2022. The SF Municipal Transport Agency (“SF Mobility Trend Report 2018”) reported there were 5.6 million miles per day driven by humans in SF in 2016, i.e. 2,044 million miles per year. So the fatality rates are: Cruise & Waymo: 0.00/billion miles; Humans: 18.4/billion miles (37/2.044).
The question is – are these two rates different? One might be tempted to think that because the AV sample size is 526 times smaller, the error must be 526x larger. However, because sampling errors are inversely proportional to the square root of the sample size, the difference in sample sizes matters far less than one would naively think.
However, the key concern really is not a statistical one – it is a categorical one. Are these two populations (AVs vs human drivers) inherently different in the way which they cause fatalities? The answer is YES.
Human drivers inherently will kill other humans:
- humans are incapable of paying constant, unwavering attention to driving
- humans have a single, passive visual sensor capable of about 6° of arc
- humans (as a group) are incapable of following the rules of the road
- humans are incapable of improving their driving except by individual instruction over many hours
AVs are purpose-designed not to kill or injure humans.
- AVs have nothing but constant, unwavering attention to the task at hand
- AVs have as many as needed active and passive sensors covering 360°
- AVs do follow the rules of the road (which is why human drivers find them annoying)
- an entire fleet of AVs are capable of instantly improving their driving by being reprogrammed.
The only ethical course of action is to push ahead with the transition to AVs as quickly as possible
In my opinion, it's not going to happen until self-driving cars no longer make errors that no human would ever make -- like colliding with a firetruck. It won't be enough that they're statistically safer, even if they are. They will have to demonstrably be no worse than humans at handling situations that humans reliably get right. Until then, people won't trust them (and I wouldn't either).
Yes, consider a wounded person on a sidewalk - will a car pick that person and get him/her to a hospital or at least call an ambulance? It may need to consult with a current passenger first - will it? Simple cases, no simple solutions, well, my solution is not simple.
I forgot that humans don’t collide into fire trucks (along with curbs, stores, signposts, puppies, and children). I think I need to stop reading comments on these articles.
Thank you. I made a similar comment just now. Fatal errors due to absence of human common sense will HORRIFY ordinary people and kill the industry. However, if these cars make cab fares super cheap, some people will always be willing to pay.
Stunning how we must have self-driving cars at any cost. But consider improving mass transit? Not exciting enough. It's like we're trying to live a fantasy rather than solve real problems.
Not necessarily orthogonal. Depends on how the problem is described. My sense is we're not defining the most important problems effectively.
mindshare, talent, and investment $ have been sucked up, so not entirely orthogonal
Just subscribed so copying here an email I just sent to Gary:
Great article - as usual! In a case of pundit me-too-ism, below is an article that I think would interest you that I published in the WSJ in 2018.
https://www.wsj.com/articles/why-we-find-self-driving-cars-so-scary-1527784724
This article reaches the same conclusion, but takes a different tack, which I think should be added to your pile of reasons we aren’t about to get full-self-driving cars: Consumers aren’t going to tolerate a product whose mistakes are not understandable and reasonable.
Here’s the tl;dr:
"how and when they fail matters a lot…. If their mistakes mimic human errors … people are likely to be more accepting of the new technology. …But if their failures seem bizarre and unpredictable, adoption of this nascent technology will encounter serious resistance. Unfortunately, this is likely to remain the case."
agreed, though of course at some point – not soon— one could envision customer resistance being irrational (eg if driverless cars driven same miles/conditions etc led to 1/4 of the fatalities but the ones it caused were different and weird and unreasoable) ps thanks for subscribing!
I think, as a matter of consumer and legislator psychology, that if the failure modes remain unreasonable, the statistics will have to be better by at least an order of magnitude before people accept SDCs. Irrational? Perhaps. But people will want to have a mental model of how these things work and when they're likely to fail, and if no such model is evident, or it's too different from human capabilities for people to grasp, they're going to be very uncomfortable.
Great article, as usual. And more like it will be needed. The edge case problem is a painful and persistent thorn on the side of the autonomous car industry. If it weren't for politics, no existing self-driving vehicle would be allowed on public roads. To do so is criminal, in my opinion.
The edge case problem is a real show stopper in our quest to solve AGI. There is no question that a truly full self-driving car will need an AGI solution. The current machine learning paradigm is no different in principle than the rule-based systems of the last century. Deep learning systems just have more rules but the fragility remains. AGI researchers should take a lesson from the lowly honeybee. It has less than a million neurons but it can navigate and operate in extremely complex and unpredictable environments. Edge cases are not a problem for the bee. How come? It is because the bee can generalize, that is, it can reuse existing cognitive structures in new situations.
We will not crack AGI unless and until generalization is solved. Based on my study of the capabilities of insects, it is my considered opinion that a full self-driving car is achievable with less than 100 million neurons and a fully generalized brain. Deep learning will not be part of the solution, that's for sure. Regardless of the usual protestations, DL cannot generalize by design.
Edge cases could be solved by large language models reasoning through a possible scenario when confronted with novel situations. Given their increasing performance on zero-shot tasks, I would think that incorporating a fine-tuned language model into the FSD stack is a workable solution.
Terrible idea. For this to even begin to work, the car would have to generate a verbal description of the scene before it, capturing only the most relevant aspects, and all in milliseconds. We don't have anything that can do that. Then, the text the LLM was trained on would have to have contained millions, probably, of examples of drivers analyzing such situations verbally and deciding what to do. Humans don't even do that; where would such text be sourced from?
LLMS ARE NOT (M)AGI(C). If you take anything away from reading Gary, it should be that.
"Terrible idea."
No, it's not. You are being silly. Here are some considerations:
1. The car does not necessarily have to react in milliseconds, That is an artificial metric. Reactions on the order of hundreds of milliseconds are fine. A small model running on custom hardware could theoretically output hundreds of tokens per second, allowing for analysis and responses on human scale timeframes.
2. You could easily run a 3-billion parameter model that can take an input from a classifier, identify the situation and output solutions which are fed back into the vehicle control. It doesn't have to be trained on "millions of examples" of conversations about driving, the representation of how to safely drive and examples of edge cases are already encoded in the models. In fact, with sufficient fine-tuning, it could easily become the most knowledgable driver on the road.
3. Go ahead and ask ChatGPT, "If you are driving next to parked cars and you see a pedestrian step behind one out of your sight, what happens next?" and see how it responds. It answers like a prudent and cautious driver would. It already knows what to do.
4. Now go ahead and ask it, "If you are driving a car and see a section of the road that has a sign that reads "wet cement", what should you do?" and see what is says.
5. Nobody said anything about A.G.I.. What I proposed was merging self-driving models, such as the vision model generated by a vehicle, with a LLM, as it adds in knowledge and safety and can handle the edge cases that vision models simply cannot. I.e. it adds an additional domain of intelligence that can be used to guide the vehicle. LLMs have varying degrees of zero-shot performance but the purported "edge case" problem may simply exist because of the limited architecture employed.
I would like to hear what Gary thinks about this idea, as opposed to simply reading a knee-jerk reaction.
It’s indeed a terrible idea. Just think of the hallucinations alone. LLMs are nowhere near reliable enough to support safe driving.
No, it is not. You could easily have a small, fine-tuned model specifically for driving tasks that is safe and reliable. It is not a chatbot, wouldn't have the same scope of inputs nor necessarily the same context window size. Only takes in a limited set of data from the classifier, not asking it to generate citations for authors.
Let's check back in five years and see where we are with the self-driving stack.
How about making them output first-order logic or something? Then it could be checked for soundness before making a decision.
This could also solve the "humans learn from a few examples" problem, because the AI could then link this logic, perhaps in a semantic net, to a vast amount of previously accumulated knowledge and thus be able to acquire immense amounts of new knowledge through inference rules later.
You seem to think that generating a verbal description of an image is a solved problem in machine vision. Not even close.
In one example Gary references, an SDC collided with a firetruck. What does that tell us? It failed to detect that a large, brightly colored vehicle was in its vicinity. If it can't even detect that fact, it's not going to be able to generate a sentence like "There's a firetruck in front of me" to feed to the LLM. That's the first problem: before we can produce a verbal representation, we need a visual/spatial representation of what objects are in the scene, some of which may be partially occluded by others, and we can't even build that now. I'm not up on the latest work in this area, but I understand that Geoff Hinton invented capsule networks in an attempt to make progress on this problem, and nobody is using them; I guess they don't have an efficient GPU implementation.
The second problem is that even if you can produce a textual description of the scene, you have to decide what to put in it and what to leave out. In a busy city intersection, there may be dozens of other vehicles, cyclists, and pedestrians in your field of view, not to mention construction sites, traffic cones, and other stationary objects. The model has to focus on the relevant ones, and omit or summarize the rest; otherwise it will take multiple paragraphs to describe the scene, which will take too long and swamp the LLM.
That's exacerbated by the third problem, which is that a lot of the relevant information is in the visual/spatial modality: the location and vector velocity of each object. Adding that in will further bloat the description.
Consider your example about seeing a "Wet Cement" sign. The correct action depends on more details: is the wet cement on a part of the roadway that the car is on course to drive over? Or is it, perhaps, just on the sidewalk, in which case the car should ignore it? The distinction is crucial, and a sufficiently detailed description of the scene to capture it is going to have a lot of irrelevant details also.
As for the latency requirement, okay, maybe it's a few hundred milliseconds, but remember, generation of the verbal description is only the first step; the LLM would still have to process it, and further processing would be necessary to determine how to drive the actuators in light of the LLM's analysis.
And then there's the power budget. Currently, from what I can find, ChatGPT inference takes on the order of 2Wh/query; so to do one query per second, which I'm not sure is enough, would take 7200W. That's already 100 times the estimates I was able to find of the power consumption of Tesla's current chip. A smaller model and custom silicon would bring it down, but it would have to come down two orders of magnitude just to get within the realm of possibility — and that's just for the LLM itself.
The real question to ask here is, do we get any leverage on the problem by changing modalities from visual/spatial to verbal? Notwithstanding your argument, I submit that the answer is clearly no. Any of these cases that we can represent textually, we can represent and reason about more generally and efficiently in the visual/spatial domain.
"You seem to think that generating a verbal description of an image is a solved problem in machine vision. Not even close."
No, it is pretty close to solved. Google has it running with Bard and it is pretty accurate. There are over 228 models of various sizes and capabilities available on hugginface to download and run locally.
Again, the idea is not to replace the current visual/spatial model with an LLM, but to add intelligence into the equation to deal with the edge cases, which is the topic Gary is addressing in his post. As I mentioned previously, you don't have to have a model the size of ChatGPT, but a fine-tuned small model, say three or seven billion parameters, that has been trained on Level 4 disengagement data. And as the cost and power requirements of compute fall over time, it will be possible to have the reasoning of a language model interacting with the other driving modalities in real time.
See https://arxiv.org/pdf/2307.07162.pdf
The issue with edge cases could conceivably be dealt with by adding the knowledge and reasoning of a language model. And the attention problem could be sorted by a model trained on disengagements so that it values image to text inputs that are highly weighted internally with warnings.
Will it be fast enough to work in real-time? Probably. Maybe in five years. But the takeaway is that you can address the edge case issue, maybe very well. Combined with the other safety benefits of self-driving, such as faster responses to hazards and the fact that the driver isn't intoxicated, could result in 10x improvements in safety over human drivers.
What is missing in all current approaches (LLMs, ML, DL, ANNs, LN – Liquid Neurons…) is how the information/data that an ensemble of agents collects, vehicle dynamics, traffic management, road and weather conditions, known/anticipated situations/obstacles and preparation for unknown situations/obstacles etc. is fully integrated in a manner that some form of intelligent tech can process to find meaning in this data, and then instantly make a decision, not just any decision – but make the best decision possible (and defining ‘best decision’ adds even more complexity to the activity). And yet, this challenge is even more complex than it appears on the surface as the impact/value of each data point, from each agent, has on all other data points and more importantly on the whole of the situation must be included in the calculations, and hence the tech’s decisioning, changes in each and every situation, potentially changing what is the best decision to make and what action to take, or not to take. With a self-driving car, the impact of each agent’s data point can change in a fraction of a second, so the prior decision(s) must instantly be reevaluated, potentially changed and new actioning executed properly with integration of the new data in a continuously changing situation. This is the level of “reasoning” required. To achieve this level of reasoning requires a very different approach, starting with looking/defining the whole of the situation that tech is to process as the very first step, and what are the acceptable kind of decisions and actions desired for the tech to arrive at and execute, and are these even possible. This is not the typical approach of the tech world.
"The issue with edge cases could conceivably be dealt with by adding the knowledge and reasoning of a language model" except that LLMs have neither.
There is no fundamental difference between a 'hallucination' and 'correct' output of LLMs. It is all a form of hallucination, it just happens that where the LLM has been fed with enough data, it will correctly create hallucinations that are reliable 'far from the edges'. LLMs are huge statistical inference machines that reliably can mimic things like language structure (because of how it handles language — not as what we would call language). Basically, when LLMs answer correctly they are 'hallucinating a reality'.
It is not that we cannot do very useful and powerful things with transformers and their ilk, but they are statistic machines and hence by definition not reliable in edge cases. But they do seem reliable to us in part because they produce perfect language.
Do you not think it's *a matter of time* until driverless are on balance at least as safe as drivers? If you do think so, then what does your distribution of the arrival time look like?
just a guess, but at least another decade, maybe two or three. not a century, but maybe not possible with current techniques alone.
As Scott Burson wrote above “ it's not going to happen until self-driving cars no longer make errors that no human would ever make -- like colliding with a firetruck. It won't be enough that they're statistically safer, even if they are. They will have to demonstrably be no worse than humans at handling situations that humans reliably get right. Until then, people won't trust them (and I wouldn't either).”
I still shudder thinking about how many dead escooter riders we would have had to bury if driverless cars were ubiquitous when the scooters first appeared on streets in March 2018.
I would not ride my unicycle in traffic in SF right now, that’s for damn sure. (I used to in NYC, some years ago).
If we are requiring for the automated driving systems a 100 % reliability in all circumstances, they will never be allowed to operate. But what is a reliability (probability of error, of accident) for the best, the most expert human driver? The question is how far these systems can be ultimately better than humans. They are not yet ready obviously. The billions spent by car manufacturers were not enough apparently. But hundreds of billions are now spent all over the world on AI algorithms for all kind of applications and self-driving cars could quickly benefit from the general progress in this technology. With more computer power, with more efficient learning algorithms, a lot of ‘edge cases’ could be incorporated in the training data set and the system reliability could approach the desired level.