Sora’s Surreal Physics

Gary Marcus

Feb 16, 2024

118

Some thoughts on what it all means for AGI

Read →

79 Comments

Jurgen Gravestein

Feb 16

I think this is what Andrej Karpathy meant when he said: “I always struggle a bit with I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.”

(https://twitter.com/karpathy/status/1733299213503787018?s=61&t=20jnuQQ5opsvX5aRlZ2UBg)

Expand full comment

Reply (2)

August Pamplona

Apr 8

Sure. But in some sense, that is something they also have in common with biological neural networks. In fact, it's possible that this is *all* we do as well. Kyle Hill had a video on the subject recently:

https://www.youtube.com/watch?v=INpWNP5HPNQ

Expand full comment

DC Reade

Feb 23Edited

So are Hollywood movies. A massive industry, built out of dreams.

uh-oh.

Expand full comment

Roumen Popov

Feb 16

"It increasingly looks like we will build an AGI with just scaling things up an order of magnitude or so, maybe two." - such absurd statements just reveal a lack of understanding of even the basic problems in AI. Any cs graduate would/should know that attacking an exponential complexity problem (which is what the real world is) with a brute force approach (just scaling things up) is doomed. But because there are no good ideas currently how to really solve intelligence, people behave like a drowning man clutching onto a straw.

Expand full comment

Reply (1)

Thomas Laussermair

Feb 19Edited

By this logic ChatGPT should have never been able to do what it is doing now - namely speaking articulately about a wide variety of topics and showing some level of understanding simply by predicting next words in sentences.

This type of understanding is an emergent property of the architecture and learning process - just like a baby learns about the world and builds up an understanding of its environment. Animals do so, too, but human brains have evolutionarily speaking been scaled up (especially the neocortex) and achieved a much higher degree of understanding. To say that scaling up alone is insufficient and can't possibly be sufficient is akin to postulating some as of yet undiscovered special ingredient in human learning - sounds a lot like the "elan vitale" claimed for organisms to be alive before we understood the mechanics of RNA, DNA and evolution.

Just like a baby doesn't go from not understanding at birth to truly understanding later in life in a single binary step but instead gradually over many years, such computer models will also gradually improve their understanding until they achieve AGI and eventually super-human level of understanding in most if not all domains.

Expand full comment

Reply (1)

Roumen Popov

Feb 21Edited

This would be a long explanation why this is so. So, I am going to just give a counter example - if ChatGPT can indeed learn like a baby, how come it's unable to do simple arithmetic, like adding two numbers. I mean, if it can learn to speak articulately it should be able to also learn to do simple arithmetic, right. The brute forcing is clearly visible in this test, LLMs can do arithmetic with small numbers, e.g. 2-3 digits because the number of combinations is small (2x3 = 6 digits = 1 million combinations) and they can memorize all the combinations, however once we ago above 4 digits performance degrades and with 6 digit numbers is almost 0%. This is because 2x6 = 12 digits = 1 trillion combinations and even Microsoft's data centres would have a hard time remembering that many combinations.

Expand full comment

Reply (1)

Thomas Laussermair

Feb 21

Let's be careful what the actual claim is here. I'm not saying that a brute force approach will be successful - the combinatorial explosion surely eliminates that possibility. But I am saying that scaling up can lead to emergent properties of a system which are unexpected (like ChatGPT's articulate answering of questions). Such emergent properties are hard if not impossible to predict, but they are also hard if not impossible to rule out. This is how I understand your claim - namely that scaling up can NEVER lead to AGI and assuming so would be absurd and showing lack of understanding...

No baby is born with the knowledge how to do simple math. They all learn it through interacting with the world and building a mental model of their environment and subsequently of abstract concepts like simple math.

I would claim that such abilities are an emergent property of our neocortex and evolution's main trick in getting us there was scaling up. I am not convinced by one counter example of applying naive logic ("if it can learn to speak articulately it should be able to also learn to do simple arithmetic, right") or stressing the combinatorial explosion argument (which clearly human babies are overcoming). To me, the question basically is: Do we need another quality of reasoning (data representation, learning mechanism, elan vitale/mentale) or will scaling up lead to emergent properties sufficient for AGI?

Expand full comment

Reply (1)

Roumen Popov

Feb 21

my point is that scaling alone is not enough, you need the right algorithm to scale up, otherwise you can just pick a look-up table and scale it up and if you can keep up with the combinatorial explosion it will solve the problem and if you are an outside observer you might conclude that the scaled up look-up table exhibits emergent properties because it's answering the questions correctly while in reality it's just supplying memorized answers. The abilities of our brain are not a result of scaling alone, even much simpler brains show remarkable abilities - crows solving fairly complex logic puzzles, bumblebees playing with balls, jumping spiders recognizing objects and planning a fairly complex attack strategy, etc.

Expand full comment

Ari

Feb 16

re: the pirate ships, I worked on fluid simulation for feature films years ago, so I found this one particularly painful. I think the specific problem is that while Sora can do a convincing cup of coffee, and reasonably convincing pirate ships, it does not have any understanding of fluid simulation, and that you can't simply combine simulations from different physical scales. Fluid simulation does not look the same on coffee-cup scale as on battleship scale.

For it to look right, you'd want to simulate inch-long ships in a couple-of-inches cup of coffee.

Simulating real-size ships in coffee at ship scale would result in waves and foam behavior (e.g. bubble size) that would not be believable as "a cup of coffee", they'd 100% read as "a lake of coffee"

Expand full comment

Richard Self

Feb 16

See https://cdn.openai.com/sora/videos/tiny-construction.mp4 for the video, magical quality and such strange physics.

By their own words, what they have done is to use GPT4 to prompt engineer Dall-E 3. There is no attempt to include any form of explicit world knowledge or physics engines.

It is all based on the intellectually lazy process that OpenAI is so good at, "Just let's use all the data that we can steal / scrape, throw it at a learning engine, expend enormous amounts of compute power and perhaps some emergent properties will just magically happen".

They then hype the offering as being able to provide simulation systems. Not a chance until they include real world knowledge, meaning and physics engines in the tool, which of course is going to be quite difficult.

On their research page, they show some of these wildly wrong videos

Expand full comment

Raj Iyer

Feb 16

It would be utterly fascinating to see if these supposed simulations of the physical world compare to human visual intuition. It should be possible to test this, right? Have humans vs Sora "autocomplete" the motion of an object or person in a snippet of video. At this stage, I think a 3 year old who can understand instructions can probably do better than Sora. So can many animals, I'd think.

What confuses me is why anyone would expect 2D video to be sufficient to generating a good simulation/representation of the 3D world, just based on this kind of training alone. Depth information in the videos Sora must be trained on is all over the place, so why would it magically intuit correct 3-dimensional physics from this?

Expand full comment

Reply (2)

Gary Marcus

Feb 16

So much magical thinking, so little time

Expand full comment

Saty Chary

Feb 16Edited

And, real world physics, material PHENOMENA can *****NEVER***** be intuited solely using ALL the data in the universe! If that were possible, we can save money, space and effort by shutting down JPL, CERN, every high school and college physics/chem/biol/... lab ever - let's just capture video and decipher nature via it :)

Expand full comment

Reply (1)

Aaron Turner

Feb 18

Causality 101

Expand full comment

Reply (1)

Saty Chary

Feb 18

Right? 'Naive/qualitative physics' used to be a thing in 80s AI but never took off.

Expand full comment

Amy A

Feb 16

Sora: AI so dangerous only OpenAI can save us. Why do I feel like we've heard this one before....

Expand full comment

Ben

Feb 17

Agree. The issues seem to be similar to the 'understanding' challenges in LLMs; no model of the world. As Gary indicates, without this it seems the intelligence will not be able to get any deeper.

Expand full comment

Roman Peczalski

Feb 16Edited

These tools are not a way towards a representation of the physical world but to the generation of a fake one. A fake world on internet in which we are going to be drowned and lost, which will disconnect us from the real one, which will replace the true one. It is simply frightening.

Expand full comment

Reply (1)

Comment deleted

Feb 16

Comment deleted

Expand full comment

Roman Peczalski

Feb 17

In fact, some people will definitely turn away from social media on internet, but is there really a global trend to use electronic media less? I am afraid not. The power of attraction of the internet is very strong, more and more people are like hypnotized, addicted to get new content, any content, reliable or not, at any moment and all this is amplified by smartphones.

Expand full comment

Reply (1)

Comment deleted

Feb 17

Comment deleted

Expand full comment

Roman Peczalski

Feb 17

I agree with the prospect of two groups, a group of people who are almost disconnected and a group of people who are constanly connected and also totally dependent.

Expand full comment

Geoff Livingston

Feb 16

Many, many issues. I also question what the actual quality will be. It's one thing to drop your best demos. It's different once real people use it. I personally found some of the demo quality to be wonky CGI, a la the pirate ships.

Expand full comment

Laura MacCleery

Feb 16

Was also struck by the flame not burning the cute monster or flickering when the monster touched it—appeared as different planes of reality or as in a dream. The same problem occurs when Claude writes a memo for me—it has ideas untethered by facts, and often takes longer to correct and rewrite than to just do it myself— and it also lacks a sense of audience, of course. The sense is creation of something parallel, a simulacrum.

Expand full comment

Reply (1)

Gary Marcus

Feb 16

good catch

Expand full comment

Jan Andrew Bloxham

Feb 17

I see no spatial understanding whatsoever here, nor can I imagine how it could even emerge from how this model was created. I do see nice image generation, and am very impressed. Clearly this moves a lot of things in our world (not necessarily in a healthy direction).

As far as I can tell, this is fundementally not in the ballpark where AGI is lurking somewhere. It’s not even the same universe. To achieve AGI, they’d need to go back to the drawing board and start from scratch.

Expand full comment

Reply (1)

Gary Marcus

Feb 17

💯

Expand full comment

Karsten

Feb 17Edited

First, most obvious thing that I noticed: everyone is wearing sunglasses. Which means they still can't model eyes, much less the world.

Even if they could model eyes properly, 30 sequential images, or 300, or 30, 000, is not a model of the world. It's a movie. And as Hollywood had proven, you don't need to be intelligent to make a movie.

Expand full comment

Purnima Gauthron

Feb 16

What in the world is this? Poor man's Pixar?

Expand full comment

Saty Chary

Feb 16Edited

Hi Gary, another nice article! "Create a clip where there is absolutely no elephant in the room" - if it can't generate a correct static image, there is no sudden reason why it would create a correct video. Obviously, video > image > text when it comes to impressiveness, but as long as the underlying architecture is the same, no miraculous jump to reality will happen, it's all still blind numerical computation of pixel mashups - now in multiple frames.

Obviously it can be useful, can be fun, can be dangerous, etc. But the Emperor still has no clothes.

Expand full comment

neandrothal

Feb 16

In the Tokyo short, at about 50% (don’t have a timestamp) there’s a glitch with the woman’s feet. It looks like when you jump a bit in order to walk in step with someone else, except that her pace remains uncannily smooth and steady.

Expand full comment

Reply (2)

Paul Jurczak

Feb 17

Yes, there are 2 consecutive skip-steps starting at 0:28. At 0:08 the person with a white handbag behind her to the right on the screen makes two consecutive steps with her left leg. A neat trick, if you can learn it. Or if you have a natural talent of Michael Jackson...

Expand full comment

Dadda

Feb 16

The scale of the walking people is completely wrong, look at the woman on the left halfway along and the two people in masks at the end, and they float onto the path at the start.

Expand full comment

Phil Friedman

Feb 19

Gary, my concern is not so much about the errors in generated images, but about what those errors mean -- which is that text generated could be rife with misstatements and factual errors, yet presented and, worse, accepted, as authoritative. It's a repeat of the phenomenon I observed when I first worked on an Apple II Plus, namely, that there is a widespread tendency to accept as fact "what the computer says."

Expand full comment

Reply (1)

Phil Friedman

Feb 20

Gary, have you seen this? https://www.linkedin.com/safety/go?url=https%3A%2F%2Fmodernity.news%2F2024%2F02%2F17%2Fleading-scientific-journal-publishes-fake-ai-generated-paper-about-rat-with-giant-penis%2F&trk=flagship-messaging-web&messageThreadUrn=urn%3Ali%3AmessagingThread%3A2-NzYwYjYwYjAtODFiNi01YzRiLWJiNjQtZDI5YTIwOTE4NmExXzAwMA%3D%3D&lipi=urn%3Ali%3Apage%3Ad_flagship3_notifications%3B7WT6CcSpQNeFTe9BI8hJ7A%3D%3D

Expand full comment

Reply (1)

Gary Marcus

Feb 20

it’s now been retracted, but surely sign of things to come

Expand full comment

Reply (1)

Phil Friedman

Feb 20

The fallacy is thinking that a knowledge base can be built on the basis of quantifiable units that are ordered for popularity. When did we forget the meaning of GIGO?

Expand full comment

Marcus on AI

Sora’s Surreal Physics