I think this is what Andrej Karpathy meant when he said: “I always struggle a bit with I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.”
Maybe I’m just too much of an outsider normie (i.e. not on twitter) but…how is shitty Dall-E that moves for 5 minutes even in the conversation of AGI? Like what is the connection there?
Self-driving cars actually have to navigate the physical world and people seem to have mightily cooled off on saying they’re on their way to AGI. I don’t see a use case for or a larger narrative of intelligence about these neat but frankly useless and bizarre videos…
"It increasingly looks like we will build an AGI with just scaling things up an order of magnitude or so, maybe two." - such absurd statements just reveal a lack of understanding of even the basic problems in AI. Any cs graduate would/should know that attacking an exponential complexity problem (which is what the real world is) with a brute force approach (just scaling things up) is doomed. But because there are no good ideas currently how to really solve intelligence, people behave like a drowning man clutching onto a straw.
re: the pirate ships, I worked on fluid simulation for feature films years ago, so I found this one particularly painful. I think the specific problem is that while Sora can do a convincing cup of coffee, and reasonably convincing pirate ships, it does not have any understanding of fluid simulation, and that you can't simply combine simulations from different physical scales. Fluid simulation does not look the same on coffee-cup scale as on battleship scale.
For it to look right, you'd want to simulate inch-long ships in a couple-of-inches cup of coffee.
Simulating real-size ships in coffee at ship scale would result in waves and foam behavior (e.g. bubble size) that would not be believable as "a cup of coffee", they'd 100% read as "a lake of coffee"
By their own words, what they have done is to use GPT4 to prompt engineer Dall-E 3. There is no attempt to include any form of explicit world knowledge or physics engines.
It is all based on the intellectually lazy process that OpenAI is so good at, "Just let's use all the data that we can steal / scrape, throw it at a learning engine, expend enormous amounts of compute power and perhaps some emergent properties will just magically happen".
They then hype the offering as being able to provide simulation systems. Not a chance until they include real world knowledge, meaning and physics engines in the tool, which of course is going to be quite difficult.
On their research page, they show some of these wildly wrong videos
It would be utterly fascinating to see if these supposed simulations of the physical world compare to human visual intuition. It should be possible to test this, right? Have humans vs Sora "autocomplete" the motion of an object or person in a snippet of video. At this stage, I think a 3 year old who can understand instructions can probably do better than Sora. So can many animals, I'd think.
What confuses me is why anyone would expect 2D video to be sufficient to generating a good simulation/representation of the 3D world, just based on this kind of training alone. Depth information in the videos Sora must be trained on is all over the place, so why would it magically intuit correct 3-dimensional physics from this?
Agree. The issues seem to be similar to the 'understanding' challenges in LLMs; no model of the world. As Gary indicates, without this it seems the intelligence will not be able to get any deeper.
These tools are not a way towards a representation of the physical world but to the generation of a fake one. A fake world on internet in which we are going to be drowned and lost, which will disconnect us from the real one, which will replace the true one. It is simply frightening.
Many, many issues. I also question what the actual quality will be. It's one thing to drop your best demos. It's different once real people use it. I personally found some of the demo quality to be wonky CGI, a la the pirate ships.
Was also struck by the flame not burning the cute monster or flickering when the monster touched it—appeared as different planes of reality or as in a dream. The same problem occurs when Claude writes a memo for me—it has ideas untethered by facts, and often takes longer to correct and rewrite than to just do it myself— and it also lacks a sense of audience, of course. The sense is creation of something parallel, a simulacrum.
I see no spatial understanding whatsoever here, nor can I imagine how it could even emerge from how this model was created. I do see nice image generation, and am very impressed. Clearly this moves a lot of things in our world (not necessarily in a healthy direction).
As far as I can tell, this is fundementally not in the ballpark where AGI is lurking somewhere. It’s not even the same universe. To achieve AGI, they’d need to go back to the drawing board and start from scratch.
First, most obvious thing that I noticed: everyone is wearing sunglasses. Which means they still can't model eyes, much less the world.
Even if they could model eyes properly, 30 sequential images, or 300, or 30, 000, is not a model of the world. It's a movie. And as Hollywood had proven, you don't need to be intelligent to make a movie.
Hi Gary, another nice article! "Create a clip where there is absolutely no elephant in the room" - if it can't generate a correct static image, there is no sudden reason why it would create a correct video. Obviously, video > image > text when it comes to impressiveness, but as long as the underlying architecture is the same, no miraculous jump to reality will happen, it's all still blind numerical computation of pixel mashups - now in multiple frames.
Obviously it can be useful, can be fun, can be dangerous, etc. But the Emperor still has no clothes.
In the Tokyo short, at about 50% (don’t have a timestamp) there’s a glitch with the woman’s feet. It looks like when you jump a bit in order to walk in step with someone else, except that her pace remains uncannily smooth and steady.
I think this is what Andrej Karpathy meant when he said: “I always struggle a bit with I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.”
(https://twitter.com/karpathy/status/1733299213503787018?s=61&t=20jnuQQ5opsvX5aRlZ2UBg)
Maybe I’m just too much of an outsider normie (i.e. not on twitter) but…how is shitty Dall-E that moves for 5 minutes even in the conversation of AGI? Like what is the connection there?
Self-driving cars actually have to navigate the physical world and people seem to have mightily cooled off on saying they’re on their way to AGI. I don’t see a use case for or a larger narrative of intelligence about these neat but frankly useless and bizarre videos…
"It increasingly looks like we will build an AGI with just scaling things up an order of magnitude or so, maybe two." - such absurd statements just reveal a lack of understanding of even the basic problems in AI. Any cs graduate would/should know that attacking an exponential complexity problem (which is what the real world is) with a brute force approach (just scaling things up) is doomed. But because there are no good ideas currently how to really solve intelligence, people behave like a drowning man clutching onto a straw.
re: the pirate ships, I worked on fluid simulation for feature films years ago, so I found this one particularly painful. I think the specific problem is that while Sora can do a convincing cup of coffee, and reasonably convincing pirate ships, it does not have any understanding of fluid simulation, and that you can't simply combine simulations from different physical scales. Fluid simulation does not look the same on coffee-cup scale as on battleship scale.
For it to look right, you'd want to simulate inch-long ships in a couple-of-inches cup of coffee.
Simulating real-size ships in coffee at ship scale would result in waves and foam behavior (e.g. bubble size) that would not be believable as "a cup of coffee", they'd 100% read as "a lake of coffee"
See https://cdn.openai.com/sora/videos/tiny-construction.mp4 for the video, magical quality and such strange physics.
By their own words, what they have done is to use GPT4 to prompt engineer Dall-E 3. There is no attempt to include any form of explicit world knowledge or physics engines.
It is all based on the intellectually lazy process that OpenAI is so good at, "Just let's use all the data that we can steal / scrape, throw it at a learning engine, expend enormous amounts of compute power and perhaps some emergent properties will just magically happen".
They then hype the offering as being able to provide simulation systems. Not a chance until they include real world knowledge, meaning and physics engines in the tool, which of course is going to be quite difficult.
On their research page, they show some of these wildly wrong videos
It would be utterly fascinating to see if these supposed simulations of the physical world compare to human visual intuition. It should be possible to test this, right? Have humans vs Sora "autocomplete" the motion of an object or person in a snippet of video. At this stage, I think a 3 year old who can understand instructions can probably do better than Sora. So can many animals, I'd think.
What confuses me is why anyone would expect 2D video to be sufficient to generating a good simulation/representation of the 3D world, just based on this kind of training alone. Depth information in the videos Sora must be trained on is all over the place, so why would it magically intuit correct 3-dimensional physics from this?
Sora: AI so dangerous only OpenAI can save us. Why do I feel like we've heard this one before....
Agree. The issues seem to be similar to the 'understanding' challenges in LLMs; no model of the world. As Gary indicates, without this it seems the intelligence will not be able to get any deeper.
These tools are not a way towards a representation of the physical world but to the generation of a fake one. A fake world on internet in which we are going to be drowned and lost, which will disconnect us from the real one, which will replace the true one. It is simply frightening.
Many, many issues. I also question what the actual quality will be. It's one thing to drop your best demos. It's different once real people use it. I personally found some of the demo quality to be wonky CGI, a la the pirate ships.
Was also struck by the flame not burning the cute monster or flickering when the monster touched it—appeared as different planes of reality or as in a dream. The same problem occurs when Claude writes a memo for me—it has ideas untethered by facts, and often takes longer to correct and rewrite than to just do it myself— and it also lacks a sense of audience, of course. The sense is creation of something parallel, a simulacrum.
I see no spatial understanding whatsoever here, nor can I imagine how it could even emerge from how this model was created. I do see nice image generation, and am very impressed. Clearly this moves a lot of things in our world (not necessarily in a healthy direction).
As far as I can tell, this is fundementally not in the ballpark where AGI is lurking somewhere. It’s not even the same universe. To achieve AGI, they’d need to go back to the drawing board and start from scratch.
First, most obvious thing that I noticed: everyone is wearing sunglasses. Which means they still can't model eyes, much less the world.
Even if they could model eyes properly, 30 sequential images, or 300, or 30, 000, is not a model of the world. It's a movie. And as Hollywood had proven, you don't need to be intelligent to make a movie.
What in the world is this? Poor man's Pixar?
Hi Gary, another nice article! "Create a clip where there is absolutely no elephant in the room" - if it can't generate a correct static image, there is no sudden reason why it would create a correct video. Obviously, video > image > text when it comes to impressiveness, but as long as the underlying architecture is the same, no miraculous jump to reality will happen, it's all still blind numerical computation of pixel mashups - now in multiple frames.
Obviously it can be useful, can be fun, can be dangerous, etc. But the Emperor still has no clothes.
In the Tokyo short, at about 50% (don’t have a timestamp) there’s a glitch with the woman’s feet. It looks like when you jump a bit in order to walk in step with someone else, except that her pace remains uncannily smooth and steady.