Discussion about this post

User's avatar
Saty Chary's avatar

Hi Gary you were right on!

It is a forever unsolvable problem, using existing approaches alone [that are centered on data, including 'multimodal', 'contrastive']. That's because video will always be an incomplete record of the physical world at large. It's impossible to build a realistic world model using pixels and their text descriptions. Matter behaves on account of its structure (eg a flute with its carefully drilled holes, diffraction grating with microscopic rules, and thousands of other examples), and its interaction with forces (always invisible), under energy fields (also always invisible). What can be gleaned from one video ("this block is catching on fire") is invalidated by another ("wow it's not catching on fire"). Humans learn these via direct physical experiences, not watching videos (alone). If videos by themselves can help form world models, we could shut down every physics, biology, chemistry... lab in the world!

Expand full comment
Paul Jurczak's avatar

Object permanence artifacts are just a visual annoyance for products like Sora. Unfortunately, the same problem plagues so called self-driving systems. Not much fundamentally changed over the years in this respect. Looking at the display of modern systems, e.g. Tesla, you will often notice pedestrians, cars and trucks appear and disappear into a quantum foam. I call it the Schrödinger's traffic.

Expand full comment
26 more comments...

No posts