It is a forever unsolvable problem, using existing approaches alone [that are centered on data, including 'multimodal', 'contrastive']. That's because video will always be an incomplete record of the physical world at large. It's impossible to build a realistic world model using pixels and their text descriptions. Matter behaves on account of its structure (eg a flute with its carefully drilled holes, diffraction grating with microscopic rules, and thousands of other examples), and its interaction with forces (always invisible), under energy fields (also always invisible). What can be gleaned from one video ("this block is catching on fire") is invalidated by another ("wow it's not catching on fire"). Humans learn these via direct physical experiences, not watching videos (alone). If videos by themselves can help form world models, we could shut down every physics, biology, chemistry... lab in the world!
Also, how much of the video training set consists of special effects and CGI anyway? How would you even train a realistic world model if so much of your training set isn't realistic?
Object permanence artifacts are just a visual annoyance for products like Sora. Unfortunately, the same problem plagues so called self-driving systems. Not much fundamentally changed over the years in this respect. Looking at the display of modern systems, e.g. Tesla, you will often notice pedestrians, cars and trucks appear and disappear into a quantum foam. I call it the Schrödinger's traffic.
As I know, Waymo does try to model tracking objects, and is even able to predict things, such as the possibility of a biker suddenly showing up from behind a truck that was not seen before.
Now, these are not object permanence, but related issues. All these are likely solved approximately and implicitly, with decent-enough reliability in practice.
Dolgov said that Transformers-based systems do a lot better than what they had before. I know some folks still want human-like reasoning, but in practice that resulted in rule-based systems that could never be implemented robustly and flexibly enough. We'll see what the future brings.
Thank you for posting this, Gary. If I read you correctly, it seems we now have sufficient evidence that this technology cannot be massaged into much more than it already is at this point, because it is foundationally flawed?
I can see Sora being fine for quick background clips (however one might want to use that), but the idea of anyone doing long-form television or feature films using this technology exclusively is a fool's errand.
The Coca-Cola Christmas ad is aweful. Even though I'm sure it's been edited by humans, you can still see weird object morphing happening in a number of scenes. They just flip out of the scenes so fast that you dont have time to focus on them.
It is much harder to cheat with movies than with images, as the number of things that can go wrong increases by orders of magnitude, and the human eye can see any unnatural movement.
So, not sure where this is going. However, the main application of current AI, to assistants doing work, is much more likely to work out, as the language and action space is a lot smaller than the video space.
Advanced AI systems need to have a broad, deep, and accurate internal model of the physical universe (world model) in order to be able to understand, reason, and predict things about the actual physical universe. The internal world models maintained by today's GenAI systems are very broad, but only superficially deep, and significantly inaccurate. Accordingly, GenAI's ability to understand, reason, and predict things about the actual physical universe is severely compromised. Unfortunately for the investors in GenAI, this weakness is fundamental to how transformers (and neural nets in general) synthesise internal world models from their training data. It can't be properly fixed by any amount of brute scaling, or any simple fudge like RAG or CoT. Are we learning yet...?
This is all correct. Today's systems are broad but shallow. This is progress though, as until recently our systems were deep but very narrow.
I think it will be easier to make a broad and shallow system deep, by adding lots of algorithms and models where needed, than to make narrow systems broad.
I agree with all of it except this part: "The John Locke hypothesis that you can learn physics purely from sense data is failing"
It's surely precisely the limitations in the range of modalities of "sense data" available in the training data that is a substantial part of the problem.
It's been my contention that computing takes "leaps" insofar as we invent new input and output technologies (the mouse, LCD, MEMS...), and progress has less to do with raw processing capacity than might be commonly assumed.
The corollary is that AI isn't AGI not only because of insufficient scale, but also not, as Marcus would hold, because it's missing some key software. AI is falling short of AGI because generative models "generate" only in the extremely narrow domains of text and textual image description.
We're limiting AI to the current limit of classical computing, i.e. what we've done so far with our keyboards and mice, which is extremely far off even the most basic animal experience of the world, nevermind a sentient animal. I mean, a tube worm on the seabed has a more varied, multimodal sensory experience than an AI. It can at least hope to learn something more fundamental about the world.
Mario Bunge challenged me in 1998 to build a computer to generate novel scientific hypotheses. He was convinced it was impossible; I am not anti-AI in that sense. However, I do agree with his view that empiricism is incorrect - the role of rationalism is mysterious to me, but I agree hypotheses are invented, not read off data. I now wonder (Michotte vs. Hume) whether our kinesthetic experience is part of this. Psychologists - including our host - are there pathologies where people, for example - lose the ability to infer causation?
Certainly makes sense that a model trained only on images can’t infer and apply a real physics model. Whether or not a full set of sensory data might suffice is an open question; humans leverage more than vision to make sense of the world. Even if our physics model is partially innate, it was burned into the brain over generations of experiments using only sensory data leading to selective reproductive survival (unless you think God designed it in…). So I wouldn’t totally give up hope that we can use induction to make reasonably accurate world models - but with more than passive vision models.
Try dice! No, seriously, Sora and dice managed to work out worse than my already low expectations. I was thinking, "there's no way a video of dice will stand up to scrutiny. They'll have the wrong numbers of pips, or sides will repeat". But no, the writhing, only vaguely dice-shaped blobs I've been getting with natural-language requests are something else entirely.
We learn basic physics in relation to our bodies, not with verbally descriptions or 2D imagery. I noticed when the Amsterdam science museum opened a few decades ago, a superb building, the computer simulations were pathetic. The Foucault pendulum at the Smithsonian is a fantastic example of almost
Feeling the rotation of the earth. All the 2D computer displays were eventually replaced with actual science exhibits. Until they grasp how to encode temporoapatial objects that will be a poor area.
Vis-a-vis self-driving cars, I’ve used them 3-4 times a week for a year, Waymo. Superb, and everyone I know who uses them feels safer than with a human driver. Used them in two cities now. I’m not sure about other vehicles but Waymo has done a great job .
Thanks for the article; well written and I get the point.
But respectfully I think the view of the article is too small. AI models can absolutely ace physics ... *IF* they're trained on physics data. Saying LLMs can't do physics is like saying English majors can't do physics. It's not wrong. But what does it tell us really?
Just listened to a talk by Geoffrey Hinton in which gave a reductive and uncomprehending dismissal of your critique of ML. I can't imagine how aggravating that must be. But surely the important point is that he feels that he must dismiss your criticisms. Deep down, he doesn't know, and he more or less suspects that.
It's amusing that this is very much like the trouble the longtermists who infest this field have with the reality of astrophysics and relativity. As opposed to the science fiction they seem to think is real.
While the fundamental mistakes of generative AI are worth pointing out, unfortunately for the vast majority of the public, most details will be missed.
With that in mind I think the big talking points should be on the wholesale stealing of art and copyrighted art by OpenAI and others and the devastating environmental impact such a useless product imposes.
As we saw with the humming bird video (and many others), genAI in many cases is simply copying videos it has stolen from artists and presenting it as if it was its own creation. The number of videos is likely in the mulitbillions, no human can recall if a video is new or simply a copy at that point. The same trick was used by codeGen and LLMs, the AI hucksters attempting to present plagiarism as novel intelligence.
Notice, nowhere has OpenAI in its hundreds of pages of product info detailed what training data was used. We know its youtube videos, we know its movies, we know its social media videos, all stolen without consent.
The environment talking point needs to be constantly shoved in their faces as well. A LLM uses 100-1000 times more energy than something like a Google search while doing in most cases the exact same thing. 1000-10,000 times more energy to make a crappy AI image and replace the human artist. The whole thing would be a comedy if it wasn't forced on everyone's eyeballs by Big Tech and Wall Street.
The good news. Artists, creatives in every field are starting to actively hate AI and stand against it. Artists, and creatives with huge fan followings are convincing the greater public to not support this theft of creativity. Movements and solidarity is forming.
The more support the anti-AI movement gets the greater the chance the entire business model collapses. After all, why would anyone get a "Create a movie, or song, or image" subscription when the consumers are not buying it. This is the ultimate fear of AI accelerationists. Never let them forget that the people are against this and the hate will only grow.
Hi Gary you were right on!
It is a forever unsolvable problem, using existing approaches alone [that are centered on data, including 'multimodal', 'contrastive']. That's because video will always be an incomplete record of the physical world at large. It's impossible to build a realistic world model using pixels and their text descriptions. Matter behaves on account of its structure (eg a flute with its carefully drilled holes, diffraction grating with microscopic rules, and thousands of other examples), and its interaction with forces (always invisible), under energy fields (also always invisible). What can be gleaned from one video ("this block is catching on fire") is invalidated by another ("wow it's not catching on fire"). Humans learn these via direct physical experiences, not watching videos (alone). If videos by themselves can help form world models, we could shut down every physics, biology, chemistry... lab in the world!
Also, how much of the video training set consists of special effects and CGI anyway? How would you even train a realistic world model if so much of your training set isn't realistic?
Good point. Also, synthetic data is what would be plentifully available, for training future versions [similar to how it will be, with text].
Object permanence artifacts are just a visual annoyance for products like Sora. Unfortunately, the same problem plagues so called self-driving systems. Not much fundamentally changed over the years in this respect. Looking at the display of modern systems, e.g. Tesla, you will often notice pedestrians, cars and trucks appear and disappear into a quantum foam. I call it the Schrödinger's traffic.
Waymo is way, way ahead. Their record speaks for itself. One can't afford hallucinations on the road, of course.
I've seen a vlog of Waymo ride about one year ago. Object permanence artifacts were there.
As I know, Waymo does try to model tracking objects, and is even able to predict things, such as the possibility of a biker suddenly showing up from behind a truck that was not seen before.
Now, these are not object permanence, but related issues. All these are likely solved approximately and implicitly, with decent-enough reliability in practice.
Dolgov said that Transformers-based systems do a lot better than what they had before. I know some folks still want human-like reasoning, but in practice that resulted in rule-based systems that could never be implemented robustly and flexibly enough. We'll see what the future brings.
Thank you for posting this, Gary. If I read you correctly, it seems we now have sufficient evidence that this technology cannot be massaged into much more than it already is at this point, because it is foundationally flawed?
I can see Sora being fine for quick background clips (however one might want to use that), but the idea of anyone doing long-form television or feature films using this technology exclusively is a fool's errand.
The Coca-Cola Christmas ad is aweful. Even though I'm sure it's been edited by humans, you can still see weird object morphing happening in a number of scenes. They just flip out of the scenes so fast that you dont have time to focus on them.
Yup. And look at the public backlash against it.
It is much harder to cheat with movies than with images, as the number of things that can go wrong increases by orders of magnitude, and the human eye can see any unnatural movement.
So, not sure where this is going. However, the main application of current AI, to assistants doing work, is much more likely to work out, as the language and action space is a lot smaller than the video space.
Advanced AI systems need to have a broad, deep, and accurate internal model of the physical universe (world model) in order to be able to understand, reason, and predict things about the actual physical universe. The internal world models maintained by today's GenAI systems are very broad, but only superficially deep, and significantly inaccurate. Accordingly, GenAI's ability to understand, reason, and predict things about the actual physical universe is severely compromised. Unfortunately for the investors in GenAI, this weakness is fundamental to how transformers (and neural nets in general) synthesise internal world models from their training data. It can't be properly fixed by any amount of brute scaling, or any simple fudge like RAG or CoT. Are we learning yet...?
This is all correct. Today's systems are broad but shallow. This is progress though, as until recently our systems were deep but very narrow.
I think it will be easier to make a broad and shallow system deep, by adding lots of algorithms and models where needed, than to make narrow systems broad.
I agree with all of it except this part: "The John Locke hypothesis that you can learn physics purely from sense data is failing"
It's surely precisely the limitations in the range of modalities of "sense data" available in the training data that is a substantial part of the problem.
It's been my contention that computing takes "leaps" insofar as we invent new input and output technologies (the mouse, LCD, MEMS...), and progress has less to do with raw processing capacity than might be commonly assumed.
The corollary is that AI isn't AGI not only because of insufficient scale, but also not, as Marcus would hold, because it's missing some key software. AI is falling short of AGI because generative models "generate" only in the extremely narrow domains of text and textual image description.
We're limiting AI to the current limit of classical computing, i.e. what we've done so far with our keyboards and mice, which is extremely far off even the most basic animal experience of the world, nevermind a sentient animal. I mean, a tube worm on the seabed has a more varied, multimodal sensory experience than an AI. It can at least hope to learn something more fundamental about the world.
Mario Bunge challenged me in 1998 to build a computer to generate novel scientific hypotheses. He was convinced it was impossible; I am not anti-AI in that sense. However, I do agree with his view that empiricism is incorrect - the role of rationalism is mysterious to me, but I agree hypotheses are invented, not read off data. I now wonder (Michotte vs. Hume) whether our kinesthetic experience is part of this. Psychologists - including our host - are there pathologies where people, for example - lose the ability to infer causation?
Certainly makes sense that a model trained only on images can’t infer and apply a real physics model. Whether or not a full set of sensory data might suffice is an open question; humans leverage more than vision to make sense of the world. Even if our physics model is partially innate, it was burned into the brain over generations of experiments using only sensory data leading to selective reproductive survival (unless you think God designed it in…). So I wouldn’t totally give up hope that we can use induction to make reasonably accurate world models - but with more than passive vision models.
Try dice! No, seriously, Sora and dice managed to work out worse than my already low expectations. I was thinking, "there's no way a video of dice will stand up to scrutiny. They'll have the wrong numbers of pips, or sides will repeat". But no, the writhing, only vaguely dice-shaped blobs I've been getting with natural-language requests are something else entirely.
We learn basic physics in relation to our bodies, not with verbally descriptions or 2D imagery. I noticed when the Amsterdam science museum opened a few decades ago, a superb building, the computer simulations were pathetic. The Foucault pendulum at the Smithsonian is a fantastic example of almost
Feeling the rotation of the earth. All the 2D computer displays were eventually replaced with actual science exhibits. Until they grasp how to encode temporoapatial objects that will be a poor area.
Vis-a-vis self-driving cars, I’ve used them 3-4 times a week for a year, Waymo. Superb, and everyone I know who uses them feels safer than with a human driver. Used them in two cities now. I’m not sure about other vehicles but Waymo has done a great job .
"A universal physics engine" and a "generative data engine". Very impressive. https://x.com/zhou_xian_/status/1869511650782658846
Thanks for the article; well written and I get the point.
But respectfully I think the view of the article is too small. AI models can absolutely ace physics ... *IF* they're trained on physics data. Saying LLMs can't do physics is like saying English majors can't do physics. It's not wrong. But what does it tell us really?
More structured counterpoint here, for your comment: https://substack.com/home/post/p-152976751
Bugs Bunny and Star Wars…also do not obey physics…
Just listened to a talk by Geoffrey Hinton in which gave a reductive and uncomprehending dismissal of your critique of ML. I can't imagine how aggravating that must be. But surely the important point is that he feels that he must dismiss your criticisms. Deep down, he doesn't know, and he more or less suspects that.
Hinton's talk: https://youtu.be/Es6yuMlyfPw?si=kE-zz4ZzRKuN1lTR
It's amusing that this is very much like the trouble the longtermists who infest this field have with the reality of astrophysics and relativity. As opposed to the science fiction they seem to think is real.
The insatiable quest for data "to solve it all" resembles Asimov's Last Question short story ;P https://en.wikipedia.org/wiki/The_Last_Question
Hi Gary,
While the fundamental mistakes of generative AI are worth pointing out, unfortunately for the vast majority of the public, most details will be missed.
With that in mind I think the big talking points should be on the wholesale stealing of art and copyrighted art by OpenAI and others and the devastating environmental impact such a useless product imposes.
As we saw with the humming bird video (and many others), genAI in many cases is simply copying videos it has stolen from artists and presenting it as if it was its own creation. The number of videos is likely in the mulitbillions, no human can recall if a video is new or simply a copy at that point. The same trick was used by codeGen and LLMs, the AI hucksters attempting to present plagiarism as novel intelligence.
Notice, nowhere has OpenAI in its hundreds of pages of product info detailed what training data was used. We know its youtube videos, we know its movies, we know its social media videos, all stolen without consent.
The environment talking point needs to be constantly shoved in their faces as well. A LLM uses 100-1000 times more energy than something like a Google search while doing in most cases the exact same thing. 1000-10,000 times more energy to make a crappy AI image and replace the human artist. The whole thing would be a comedy if it wasn't forced on everyone's eyeballs by Big Tech and Wall Street.
The good news. Artists, creatives in every field are starting to actively hate AI and stand against it. Artists, and creatives with huge fan followings are convincing the greater public to not support this theft of creativity. Movements and solidarity is forming.
The more support the anti-AI movement gets the greater the chance the entire business model collapses. After all, why would anyone get a "Create a movie, or song, or image" subscription when the consumers are not buying it. This is the ultimate fear of AI accelerationists. Never let them forget that the people are against this and the hate will only grow.