95 Comments

I think this is what Andrej Karpathy meant when he said: “I always struggle a bit with I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.”

(https://twitter.com/karpathy/status/1733299213503787018?s=61&t=20jnuQQ5opsvX5aRlZ2UBg)

Expand full comment
Feb 16·edited Feb 16Liked by Gary Marcus

Maybe I’m just too much of an outsider normie (i.e. not on twitter) but…how is shitty Dall-E that moves for 5 minutes even in the conversation of AGI? Like what is the connection there?

Self-driving cars actually have to navigate the physical world and people seem to have mightily cooled off on saying they’re on their way to AGI. I don’t see a use case for or a larger narrative of intelligence about these neat but frankly useless and bizarre videos…

Expand full comment

"It increasingly looks like we will build an AGI with just scaling things up an order of magnitude or so, maybe two." - such absurd statements just reveal a lack of understanding of even the basic problems in AI. Any cs graduate would/should know that attacking an exponential complexity problem (which is what the real world is) with a brute force approach (just scaling things up) is doomed. But because there are no good ideas currently how to really solve intelligence, people behave like a drowning man clutching onto a straw.

Expand full comment
Feb 16Liked by Gary Marcus

re: the pirate ships, I worked on fluid simulation for feature films years ago, so I found this one particularly painful. I think the specific problem is that while Sora can do a convincing cup of coffee, and reasonably convincing pirate ships, it does not have any understanding of fluid simulation, and that you can't simply combine simulations from different physical scales. Fluid simulation does not look the same on coffee-cup scale as on battleship scale.

For it to look right, you'd want to simulate inch-long ships in a couple-of-inches cup of coffee.

Simulating real-size ships in coffee at ship scale would result in waves and foam behavior (e.g. bubble size) that would not be believable as "a cup of coffee", they'd 100% read as "a lake of coffee"

Expand full comment

See https://cdn.openai.com/sora/videos/tiny-construction.mp4 for the video, magical quality and such strange physics.

By their own words, what they have done is to use GPT4 to prompt engineer Dall-E 3. There is no attempt to include any form of explicit world knowledge or physics engines.

It is all based on the intellectually lazy process that OpenAI is so good at, "Just let's use all the data that we can steal / scrape, throw it at a learning engine, expend enormous amounts of compute power and perhaps some emergent properties will just magically happen".

They then hype the offering as being able to provide simulation systems. Not a chance until they include real world knowledge, meaning and physics engines in the tool, which of course is going to be quite difficult.

On their research page, they show some of these wildly wrong videos

Expand full comment
Feb 16Liked by Gary Marcus

It would be utterly fascinating to see if these supposed simulations of the physical world compare to human visual intuition. It should be possible to test this, right? Have humans vs Sora "autocomplete" the motion of an object or person in a snippet of video. At this stage, I think a 3 year old who can understand instructions can probably do better than Sora. So can many animals, I'd think.

What confuses me is why anyone would expect 2D video to be sufficient to generating a good simulation/representation of the 3D world, just based on this kind of training alone. Depth information in the videos Sora must be trained on is all over the place, so why would it magically intuit correct 3-dimensional physics from this?

Expand full comment
Feb 16Liked by Gary Marcus

Sora: AI so dangerous only OpenAI can save us. Why do I feel like we've heard this one before....

Expand full comment
Feb 17Liked by Gary Marcus

Agree. The issues seem to be similar to the 'understanding' challenges in LLMs; no model of the world. As Gary indicates, without this it seems the intelligence will not be able to get any deeper.

Expand full comment
Feb 16·edited Feb 16Liked by Gary Marcus

These tools are not a way towards a representation of the physical world but to the generation of a fake one. A fake world on internet in which we are going to be drowned and lost, which will disconnect us from the real one, which will replace the true one. It is simply frightening.

Expand full comment

Many, many issues. I also question what the actual quality will be. It's one thing to drop your best demos. It's different once real people use it. I personally found some of the demo quality to be wonky CGI, a la the pirate ships.

Expand full comment
Feb 16Liked by Gary Marcus

Was also struck by the flame not burning the cute monster or flickering when the monster touched it—appeared as different planes of reality or as in a dream. The same problem occurs when Claude writes a memo for me—it has ideas untethered by facts, and often takes longer to correct and rewrite than to just do it myself— and it also lacks a sense of audience, of course. The sense is creation of something parallel, a simulacrum.

Expand full comment

I see no spatial understanding whatsoever here, nor can I imagine how it could even emerge from how this model was created. I do see nice image generation, and am very impressed. Clearly this moves a lot of things in our world (not necessarily in a healthy direction).

As far as I can tell, this is fundementally not in the ballpark where AGI is lurking somewhere. It’s not even the same universe. To achieve AGI, they’d need to go back to the drawing board and start from scratch.

Expand full comment
Feb 17·edited Feb 17Liked by Gary Marcus

First, most obvious thing that I noticed: everyone is wearing sunglasses. Which means they still can't model eyes, much less the world.

Even if they could model eyes properly, 30 sequential images, or 300, or 30, 000, is not a model of the world. It's a movie. And as Hollywood had proven, you don't need to be intelligent to make a movie.

Expand full comment

What in the world is this? Poor man's Pixar?

Expand full comment
Feb 16·edited Feb 17Liked by Gary Marcus

Hi Gary, another nice article! "Create a clip where there is absolutely no elephant in the room" - if it can't generate a correct static image, there is no sudden reason why it would create a correct video. Obviously, video > image > text when it comes to impressiveness, but as long as the underlying architecture is the same, no miraculous jump to reality will happen, it's all still blind numerical computation of pixel mashups - now in multiple frames.

Obviously it can be useful, can be fun, can be dangerous, etc. But the Emperor still has no clothes.

Expand full comment
Feb 16Liked by Gary Marcus

In the Tokyo short, at about 50% (don’t have a timestamp) there’s a glitch with the woman’s feet. It looks like when you jump a bit in order to walk in step with someone else, except that her pace remains uncannily smooth and steady.

Expand full comment