23 Comments
User's avatar
Saty Chary's avatar

Hi Gary you were right on!

It is a forever unsolvable problem, using existing approaches alone [that are centered on data, including 'multimodal', 'contrastive']. That's because video will always be an incomplete record of the physical world at large. It's impossible to build a realistic world model using pixels and their text descriptions. Matter behaves on account of its structure (eg a flute with its carefully drilled holes, diffraction grating with microscopic rules, and thousands of other examples), and its interaction with forces (always invisible), under energy fields (also always invisible). What can be gleaned from one video ("this block is catching on fire") is invalidated by another ("wow it's not catching on fire"). Humans learn these via direct physical experiences, not watching videos (alone). If videos by themselves can help form world models, we could shut down every physics, biology, chemistry... lab in the world!

Expand full comment
Tek Bunny's avatar

Also, how much of the video training set consists of special effects and CGI anyway? How would you even train a realistic world model if so much of your training set isn't realistic?

Expand full comment
Saty Chary's avatar

Good point. Also, synthetic data is what would be plentifully available, for training future versions [similar to how it will be, with text].

Expand full comment
Paul Jurczak's avatar

Object permanence artifacts are just a visual annoyance for products like Sora. Unfortunately, the same problem plagues so called self-driving systems. Not much fundamentally changed over the years in this respect. Looking at the display of modern systems, e.g. Tesla, you will often notice pedestrians, cars and trucks appear and disappear into a quantum foam. I call it the Schrödinger's traffic.

Expand full comment
User's avatar
Comment removed
Dec 10, 2024
Comment removed
Expand full comment
Paul Jurczak's avatar

I've seen a vlog of Waymo ride about one year ago. Object permanence artifacts were there.

Expand full comment
Robert Keith's avatar

Thank you for posting this, Gary. If I read you correctly, it seems we now have sufficient evidence that this technology cannot be massaged into much more than it already is at this point, because it is foundationally flawed?

I can see Sora being fine for quick background clips (however one might want to use that), but the idea of anyone doing long-form television or feature films using this technology exclusively is a fool's errand.

Expand full comment
James Francis's avatar

The Coca-Cola Christmas ad is aweful. Even though I'm sure it's been edited by humans, you can still see weird object morphing happening in a number of scenes. They just flip out of the scenes so fast that you dont have time to focus on them.

Expand full comment
Robert Keith's avatar

Yup. And look at the public backlash against it.

Expand full comment
Aaron Turner's avatar

Advanced AI systems need to have a broad, deep, and accurate internal model of the physical universe (world model) in order to be able to understand, reason, and predict things about the actual physical universe. The internal world models maintained by today's GenAI systems are very broad, but only superficially deep, and significantly inaccurate. Accordingly, GenAI's ability to understand, reason, and predict things about the actual physical universe is severely compromised. Unfortunately for the investors in GenAI, this weakness is fundamental to how transformers (and neural nets in general) synthesise internal world models from their training data. It can't be properly fixed by any amount of brute scaling, or any simple fudge like RAG or CoT. Are we learning yet...?

Expand full comment
TTLX's avatar

I agree with all of it except this part: "The John Locke hypothesis that you can learn physics purely from sense data is failing"

It's surely precisely the limitations in the range of modalities of "sense data" available in the training data that is a substantial part of the problem.

It's been my contention that computing takes "leaps" insofar as we invent new input and output technologies (the mouse, LCD, MEMS...), and progress has less to do with raw processing capacity than might be commonly assumed.

The corollary is that AI isn't AGI not only because of insufficient scale, but also not, as Marcus would hold, because it's missing some key software. AI is falling short of AGI because generative models "generate" only in the extremely narrow domains of text and textual image description.

We're limiting AI to the current limit of classical computing, i.e. what we've done so far with our keyboards and mice, which is extremely far off even the most basic animal experience of the world, nevermind a sentient animal. I mean, a tube worm on the seabed has a more varied, multimodal sensory experience than an AI. It can at least hope to learn something more fundamental about the world.

Expand full comment
keithdouglas's avatar

Mario Bunge challenged me in 1998 to build a computer to generate novel scientific hypotheses. He was convinced it was impossible; I am not anti-AI in that sense. However, I do agree with his view that empiricism is incorrect - the role of rationalism is mysterious to me, but I agree hypotheses are invented, not read off data. I now wonder (Michotte vs. Hume) whether our kinesthetic experience is part of this. Psychologists - including our host - are there pathologies where people, for example - lose the ability to infer causation?

Expand full comment
Harley Davis's avatar

Certainly makes sense that a model trained only on images can’t infer and apply a real physics model. Whether or not a full set of sensory data might suffice is an open question; humans leverage more than vision to make sense of the world. Even if our physics model is partially innate, it was burned into the brain over generations of experiments using only sensory data leading to selective reproductive survival (unless you think God designed it in…). So I wouldn’t totally give up hope that we can use induction to make reasonably accurate world models - but with more than passive vision models.

Expand full comment
TheOtherKC's avatar

Try dice! No, seriously, Sora and dice managed to work out worse than my already low expectations. I was thinking, "there's no way a video of dice will stand up to scrutiny. They'll have the wrong numbers of pips, or sides will repeat". But no, the writhing, only vaguely dice-shaped blobs I've been getting with natural-language requests are something else entirely.

Expand full comment
Sufeitzy's avatar

We learn basic physics in relation to our bodies, not with verbally descriptions or 2D imagery. I noticed when the Amsterdam science museum opened a few decades ago, a superb building, the computer simulations were pathetic. The Foucault pendulum at the Smithsonian is a fantastic example of almost

Feeling the rotation of the earth. All the 2D computer displays were eventually replaced with actual science exhibits. Until they grasp how to encode temporoapatial objects that will be a poor area.

Vis-a-vis self-driving cars, I’ve used them 3-4 times a week for a year, Waymo. Superb, and everyone I know who uses them feels safer than with a human driver. Used them in two cities now. I’m not sure about other vehicles but Waymo has done a great job .

Expand full comment
Satyaki Upadhyay's avatar

> The John Locke hypothesis that you can learn physics purely from sense data

How's John Locke related to this? Did you mean learning physics purely from the "empiricism" of video data?

Expand full comment
Bill Taylor's avatar

Thanks for the article; well written and I get the point.

But respectfully I think the view of the article is too small. AI models can absolutely ace physics ... *IF* they're trained on physics data. Saying LLMs can't do physics is like saying English majors can't do physics. It's not wrong. But what does it tell us really?

More structured counterpoint here, for your comment: https://substack.com/home/post/p-152976751

Expand full comment
Rony Abovitz's avatar

Bugs Bunny and Star Wars…also do not obey physics…

Expand full comment
Bill Benzon's avatar

Just listened to a talk by Geoffrey Hinton in which gave a reductive and uncomprehending dismissal of your critique of ML. I can't imagine how aggravating that must be. But surely the important point is that he feels that he must dismiss your criticisms. Deep down, he doesn't know, and he more or less suspects that.

Hinton's talk: https://youtu.be/Es6yuMlyfPw?si=kE-zz4ZzRKuN1lTR

Expand full comment
Martin Belderson's avatar

It's amusing that this is very much like the trouble the longtermists who infest this field have with the reality of astrophysics and relativity. As opposed to the science fiction they seem to think is real.

Expand full comment
Charles Fadel's avatar

The insatiable quest for data "to solve it all" resembles Asimov's Last Question short story ;P https://en.wikipedia.org/wiki/The_Last_Question

Expand full comment
Runner's avatar

Hi Gary,

While the fundamental mistakes of generative AI are worth pointing out, unfortunately for the vast majority of the public, most details will be missed.

With that in mind I think the big talking points should be on the wholesale stealing of art and copyrighted art by OpenAI and others and the devastating environmental impact such a useless product imposes.

As we saw with the humming bird video (and many others), genAI in many cases is simply copying videos it has stolen from artists and presenting it as if it was its own creation. The number of videos is likely in the mulitbillions, no human can recall if a video is new or simply a copy at that point. The same trick was used by codeGen and LLMs, the AI hucksters attempting to present plagiarism as novel intelligence.

Notice, nowhere has OpenAI in its hundreds of pages of product info detailed what training data was used. We know its youtube videos, we know its movies, we know its social media videos, all stolen without consent.

The environment talking point needs to be constantly shoved in their faces as well. A LLM uses 100-1000 times more energy than something like a Google search while doing in most cases the exact same thing. 1000-10,000 times more energy to make a crappy AI image and replace the human artist. The whole thing would be a comedy if it wasn't forced on everyone's eyeballs by Big Tech and Wall Street.

The good news. Artists, creatives in every field are starting to actively hate AI and stand against it. Artists, and creatives with huge fan followings are convincing the greater public to not support this theft of creativity. Movements and solidarity is forming.

The more support the anti-AI movement gets the greater the chance the entire business model collapses. After all, why would anyone get a "Create a movie, or song, or image" subscription when the consumers are not buying it. This is the ultimate fear of AI accelerationists. Never let them forget that the people are against this and the hate will only grow.

Expand full comment
Gerben Wierda's avatar

Yes, these systems are unable to do this. No surprise here. The question is not if you're right (you are, here) but how long it will take society to come to that same conclusion.

"The John Locke hypothesis that you can learn physics purely from sense data is failing, over, and over, and over again; this time perhaps (if I had to guess) to the tune of a billion dollars." is, I suspect, not yet ruled out (though 'purely' is a stretch, certainly). How much of our knowledge is built-in and how much is learned is an open question. Most of it may still be learned and not innate.

Last year I watched a very young child (say 1 year old) repeatedly plucking at the end of the sleeve of its sweater. It was doing this over and over again for an extended period and coming back to it after having been distracted. It is hard not to imagine that such behaviours lead to learning the physics of sleeves, cloth, and so forth. Much of the behaviour of very young children seems to be able to play that role and it seems not unreasonable that this is part of the formation of the basic 'mental automations' that make up our practical intelligence. We tend to pay little attention to this 'random', 'meaningless' behaviour, but I can see it to be the way we learn the basics of physics.

But the difference with GenAI systems is of course that this wasn't just watching, it was interacting. In a sense it shows an illustration of the idea that you can only truly learn something by doing it, not simply by observing it.

Expand full comment