I think this is what Andrej Karpathy meant when he said: “I always struggle a bit with I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.”
Sure. But in some sense, that is something they also have in common with biological neural networks. In fact, it's possible that this is *all* we do as well. Kyle Hill had a video on the subject recently:
"It increasingly looks like we will build an AGI with just scaling things up an order of magnitude or so, maybe two." - such absurd statements just reveal a lack of understanding of even the basic problems in AI. Any cs graduate would/should know that attacking an exponential complexity problem (which is what the real world is) with a brute force approach (just scaling things up) is doomed. But because there are no good ideas currently how to really solve intelligence, people behave like a drowning man clutching onto a straw.
By this logic ChatGPT should have never been able to do what it is doing now - namely speaking articulately about a wide variety of topics and showing some level of understanding simply by predicting next words in sentences.
This type of understanding is an emergent property of the architecture and learning process - just like a baby learns about the world and builds up an understanding of its environment. Animals do so, too, but human brains have evolutionarily speaking been scaled up (especially the neocortex) and achieved a much higher degree of understanding. To say that scaling up alone is insufficient and can't possibly be sufficient is akin to postulating some as of yet undiscovered special ingredient in human learning - sounds a lot like the "elan vitale" claimed for organisms to be alive before we understood the mechanics of RNA, DNA and evolution.
Just like a baby doesn't go from not understanding at birth to truly understanding later in life in a single binary step but instead gradually over many years, such computer models will also gradually improve their understanding until they achieve AGI and eventually super-human level of understanding in most if not all domains.
This would be a long explanation why this is so. So, I am going to just give a counter example - if ChatGPT can indeed learn like a baby, how come it's unable to do simple arithmetic, like adding two numbers. I mean, if it can learn to speak articulately it should be able to also learn to do simple arithmetic, right. The brute forcing is clearly visible in this test, LLMs can do arithmetic with small numbers, e.g. 2-3 digits because the number of combinations is small (2x3 = 6 digits = 1 million combinations) and they can memorize all the combinations, however once we ago above 4 digits performance degrades and with 6 digit numbers is almost 0%. This is because 2x6 = 12 digits = 1 trillion combinations and even Microsoft's data centres would have a hard time remembering that many combinations.
Let's be careful what the actual claim is here. I'm not saying that a brute force approach will be successful - the combinatorial explosion surely eliminates that possibility. But I am saying that scaling up can lead to emergent properties of a system which are unexpected (like ChatGPT's articulate answering of questions). Such emergent properties are hard if not impossible to predict, but they are also hard if not impossible to rule out. This is how I understand your claim - namely that scaling up can NEVER lead to AGI and assuming so would be absurd and showing lack of understanding...
No baby is born with the knowledge how to do simple math. They all learn it through interacting with the world and building a mental model of their environment and subsequently of abstract concepts like simple math.
I would claim that such abilities are an emergent property of our neocortex and evolution's main trick in getting us there was scaling up. I am not convinced by one counter example of applying naive logic ("if it can learn to speak articulately it should be able to also learn to do simple arithmetic, right") or stressing the combinatorial explosion argument (which clearly human babies are overcoming). To me, the question basically is: Do we need another quality of reasoning (data representation, learning mechanism, elan vitale/mentale) or will scaling up lead to emergent properties sufficient for AGI?
my point is that scaling alone is not enough, you need the right algorithm to scale up, otherwise you can just pick a look-up table and scale it up and if you can keep up with the combinatorial explosion it will solve the problem and if you are an outside observer you might conclude that the scaled up look-up table exhibits emergent properties because it's answering the questions correctly while in reality it's just supplying memorized answers. The abilities of our brain are not a result of scaling alone, even much simpler brains show remarkable abilities - crows solving fairly complex logic puzzles, bumblebees playing with balls, jumping spiders recognizing objects and planning a fairly complex attack strategy, etc.
re: the pirate ships, I worked on fluid simulation for feature films years ago, so I found this one particularly painful. I think the specific problem is that while Sora can do a convincing cup of coffee, and reasonably convincing pirate ships, it does not have any understanding of fluid simulation, and that you can't simply combine simulations from different physical scales. Fluid simulation does not look the same on coffee-cup scale as on battleship scale.
For it to look right, you'd want to simulate inch-long ships in a couple-of-inches cup of coffee.
Simulating real-size ships in coffee at ship scale would result in waves and foam behavior (e.g. bubble size) that would not be believable as "a cup of coffee", they'd 100% read as "a lake of coffee"
By their own words, what they have done is to use GPT4 to prompt engineer Dall-E 3. There is no attempt to include any form of explicit world knowledge or physics engines.
It is all based on the intellectually lazy process that OpenAI is so good at, "Just let's use all the data that we can steal / scrape, throw it at a learning engine, expend enormous amounts of compute power and perhaps some emergent properties will just magically happen".
They then hype the offering as being able to provide simulation systems. Not a chance until they include real world knowledge, meaning and physics engines in the tool, which of course is going to be quite difficult.
On their research page, they show some of these wildly wrong videos
It would be utterly fascinating to see if these supposed simulations of the physical world compare to human visual intuition. It should be possible to test this, right? Have humans vs Sora "autocomplete" the motion of an object or person in a snippet of video. At this stage, I think a 3 year old who can understand instructions can probably do better than Sora. So can many animals, I'd think.
What confuses me is why anyone would expect 2D video to be sufficient to generating a good simulation/representation of the 3D world, just based on this kind of training alone. Depth information in the videos Sora must be trained on is all over the place, so why would it magically intuit correct 3-dimensional physics from this?
And, real world physics, material PHENOMENA can *****NEVER***** be intuited solely using ALL the data in the universe! If that were possible, we can save money, space and effort by shutting down JPL, CERN, every high school and college physics/chem/biol/... lab ever - let's just capture video and decipher nature via it :)
Agree. The issues seem to be similar to the 'understanding' challenges in LLMs; no model of the world. As Gary indicates, without this it seems the intelligence will not be able to get any deeper.
These tools are not a way towards a representation of the physical world but to the generation of a fake one. A fake world on internet in which we are going to be drowned and lost, which will disconnect us from the real one, which will replace the true one. It is simply frightening.
In fact, some people will definitely turn away from social media on internet, but is there really a global trend to use electronic media less? I am afraid not. The power of attraction of the internet is very strong, more and more people are like hypnotized, addicted to get new content, any content, reliable or not, at any moment and all this is amplified by smartphones.
I agree with the prospect of two groups, a group of people who are almost disconnected and a group of people who are constanly connected and also totally dependent.
Many, many issues. I also question what the actual quality will be. It's one thing to drop your best demos. It's different once real people use it. I personally found some of the demo quality to be wonky CGI, a la the pirate ships.
Was also struck by the flame not burning the cute monster or flickering when the monster touched it—appeared as different planes of reality or as in a dream. The same problem occurs when Claude writes a memo for me—it has ideas untethered by facts, and often takes longer to correct and rewrite than to just do it myself— and it also lacks a sense of audience, of course. The sense is creation of something parallel, a simulacrum.
I see no spatial understanding whatsoever here, nor can I imagine how it could even emerge from how this model was created. I do see nice image generation, and am very impressed. Clearly this moves a lot of things in our world (not necessarily in a healthy direction).
As far as I can tell, this is fundementally not in the ballpark where AGI is lurking somewhere. It’s not even the same universe. To achieve AGI, they’d need to go back to the drawing board and start from scratch.
First, most obvious thing that I noticed: everyone is wearing sunglasses. Which means they still can't model eyes, much less the world.
Even if they could model eyes properly, 30 sequential images, or 300, or 30, 000, is not a model of the world. It's a movie. And as Hollywood had proven, you don't need to be intelligent to make a movie.
Hi Gary, another nice article! "Create a clip where there is absolutely no elephant in the room" - if it can't generate a correct static image, there is no sudden reason why it would create a correct video. Obviously, video > image > text when it comes to impressiveness, but as long as the underlying architecture is the same, no miraculous jump to reality will happen, it's all still blind numerical computation of pixel mashups - now in multiple frames.
Obviously it can be useful, can be fun, can be dangerous, etc. But the Emperor still has no clothes.
In the Tokyo short, at about 50% (don’t have a timestamp) there’s a glitch with the woman’s feet. It looks like when you jump a bit in order to walk in step with someone else, except that her pace remains uncannily smooth and steady.
Yes, there are 2 consecutive skip-steps starting at 0:28. At 0:08 the person with a white handbag behind her to the right on the screen makes two consecutive steps with her left leg. A neat trick, if you can learn it. Or if you have a natural talent of Michael Jackson...
The scale of the walking people is completely wrong, look at the woman on the left halfway along and the two people in masks at the end, and they float onto the path at the start.
Gary, my concern is not so much about the errors in generated images, but about what those errors mean -- which is that text generated could be rife with misstatements and factual errors, yet presented and, worse, accepted, as authoritative. It's a repeat of the phenomenon I observed when I first worked on an Apple II Plus, namely, that there is a widespread tendency to accept as fact "what the computer says."
The fallacy is thinking that a knowledge base can be built on the basis of quantifiable units that are ordered for popularity. When did we forget the meaning of GIGO?
I think this is what Andrej Karpathy meant when he said: “I always struggle a bit with I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.”
(https://twitter.com/karpathy/status/1733299213503787018?s=61&t=20jnuQQ5opsvX5aRlZ2UBg)
Sure. But in some sense, that is something they also have in common with biological neural networks. In fact, it's possible that this is *all* we do as well. Kyle Hill had a video on the subject recently:
https://www.youtube.com/watch?v=INpWNP5HPNQ
So are Hollywood movies. A massive industry, built out of dreams.
uh-oh.
"It increasingly looks like we will build an AGI with just scaling things up an order of magnitude or so, maybe two." - such absurd statements just reveal a lack of understanding of even the basic problems in AI. Any cs graduate would/should know that attacking an exponential complexity problem (which is what the real world is) with a brute force approach (just scaling things up) is doomed. But because there are no good ideas currently how to really solve intelligence, people behave like a drowning man clutching onto a straw.
By this logic ChatGPT should have never been able to do what it is doing now - namely speaking articulately about a wide variety of topics and showing some level of understanding simply by predicting next words in sentences.
This type of understanding is an emergent property of the architecture and learning process - just like a baby learns about the world and builds up an understanding of its environment. Animals do so, too, but human brains have evolutionarily speaking been scaled up (especially the neocortex) and achieved a much higher degree of understanding. To say that scaling up alone is insufficient and can't possibly be sufficient is akin to postulating some as of yet undiscovered special ingredient in human learning - sounds a lot like the "elan vitale" claimed for organisms to be alive before we understood the mechanics of RNA, DNA and evolution.
Just like a baby doesn't go from not understanding at birth to truly understanding later in life in a single binary step but instead gradually over many years, such computer models will also gradually improve their understanding until they achieve AGI and eventually super-human level of understanding in most if not all domains.
This would be a long explanation why this is so. So, I am going to just give a counter example - if ChatGPT can indeed learn like a baby, how come it's unable to do simple arithmetic, like adding two numbers. I mean, if it can learn to speak articulately it should be able to also learn to do simple arithmetic, right. The brute forcing is clearly visible in this test, LLMs can do arithmetic with small numbers, e.g. 2-3 digits because the number of combinations is small (2x3 = 6 digits = 1 million combinations) and they can memorize all the combinations, however once we ago above 4 digits performance degrades and with 6 digit numbers is almost 0%. This is because 2x6 = 12 digits = 1 trillion combinations and even Microsoft's data centres would have a hard time remembering that many combinations.
Let's be careful what the actual claim is here. I'm not saying that a brute force approach will be successful - the combinatorial explosion surely eliminates that possibility. But I am saying that scaling up can lead to emergent properties of a system which are unexpected (like ChatGPT's articulate answering of questions). Such emergent properties are hard if not impossible to predict, but they are also hard if not impossible to rule out. This is how I understand your claim - namely that scaling up can NEVER lead to AGI and assuming so would be absurd and showing lack of understanding...
No baby is born with the knowledge how to do simple math. They all learn it through interacting with the world and building a mental model of their environment and subsequently of abstract concepts like simple math.
I would claim that such abilities are an emergent property of our neocortex and evolution's main trick in getting us there was scaling up. I am not convinced by one counter example of applying naive logic ("if it can learn to speak articulately it should be able to also learn to do simple arithmetic, right") or stressing the combinatorial explosion argument (which clearly human babies are overcoming). To me, the question basically is: Do we need another quality of reasoning (data representation, learning mechanism, elan vitale/mentale) or will scaling up lead to emergent properties sufficient for AGI?
my point is that scaling alone is not enough, you need the right algorithm to scale up, otherwise you can just pick a look-up table and scale it up and if you can keep up with the combinatorial explosion it will solve the problem and if you are an outside observer you might conclude that the scaled up look-up table exhibits emergent properties because it's answering the questions correctly while in reality it's just supplying memorized answers. The abilities of our brain are not a result of scaling alone, even much simpler brains show remarkable abilities - crows solving fairly complex logic puzzles, bumblebees playing with balls, jumping spiders recognizing objects and planning a fairly complex attack strategy, etc.
re: the pirate ships, I worked on fluid simulation for feature films years ago, so I found this one particularly painful. I think the specific problem is that while Sora can do a convincing cup of coffee, and reasonably convincing pirate ships, it does not have any understanding of fluid simulation, and that you can't simply combine simulations from different physical scales. Fluid simulation does not look the same on coffee-cup scale as on battleship scale.
For it to look right, you'd want to simulate inch-long ships in a couple-of-inches cup of coffee.
Simulating real-size ships in coffee at ship scale would result in waves and foam behavior (e.g. bubble size) that would not be believable as "a cup of coffee", they'd 100% read as "a lake of coffee"
See https://cdn.openai.com/sora/videos/tiny-construction.mp4 for the video, magical quality and such strange physics.
By their own words, what they have done is to use GPT4 to prompt engineer Dall-E 3. There is no attempt to include any form of explicit world knowledge or physics engines.
It is all based on the intellectually lazy process that OpenAI is so good at, "Just let's use all the data that we can steal / scrape, throw it at a learning engine, expend enormous amounts of compute power and perhaps some emergent properties will just magically happen".
They then hype the offering as being able to provide simulation systems. Not a chance until they include real world knowledge, meaning and physics engines in the tool, which of course is going to be quite difficult.
On their research page, they show some of these wildly wrong videos
It would be utterly fascinating to see if these supposed simulations of the physical world compare to human visual intuition. It should be possible to test this, right? Have humans vs Sora "autocomplete" the motion of an object or person in a snippet of video. At this stage, I think a 3 year old who can understand instructions can probably do better than Sora. So can many animals, I'd think.
What confuses me is why anyone would expect 2D video to be sufficient to generating a good simulation/representation of the 3D world, just based on this kind of training alone. Depth information in the videos Sora must be trained on is all over the place, so why would it magically intuit correct 3-dimensional physics from this?
So much magical thinking, so little time
And, real world physics, material PHENOMENA can *****NEVER***** be intuited solely using ALL the data in the universe! If that were possible, we can save money, space and effort by shutting down JPL, CERN, every high school and college physics/chem/biol/... lab ever - let's just capture video and decipher nature via it :)
Causality 101
Right? 'Naive/qualitative physics' used to be a thing in 80s AI but never took off.
Sora: AI so dangerous only OpenAI can save us. Why do I feel like we've heard this one before....
Agree. The issues seem to be similar to the 'understanding' challenges in LLMs; no model of the world. As Gary indicates, without this it seems the intelligence will not be able to get any deeper.
These tools are not a way towards a representation of the physical world but to the generation of a fake one. A fake world on internet in which we are going to be drowned and lost, which will disconnect us from the real one, which will replace the true one. It is simply frightening.
In fact, some people will definitely turn away from social media on internet, but is there really a global trend to use electronic media less? I am afraid not. The power of attraction of the internet is very strong, more and more people are like hypnotized, addicted to get new content, any content, reliable or not, at any moment and all this is amplified by smartphones.
I agree with the prospect of two groups, a group of people who are almost disconnected and a group of people who are constanly connected and also totally dependent.
Many, many issues. I also question what the actual quality will be. It's one thing to drop your best demos. It's different once real people use it. I personally found some of the demo quality to be wonky CGI, a la the pirate ships.
Was also struck by the flame not burning the cute monster or flickering when the monster touched it—appeared as different planes of reality or as in a dream. The same problem occurs when Claude writes a memo for me—it has ideas untethered by facts, and often takes longer to correct and rewrite than to just do it myself— and it also lacks a sense of audience, of course. The sense is creation of something parallel, a simulacrum.
good catch
I see no spatial understanding whatsoever here, nor can I imagine how it could even emerge from how this model was created. I do see nice image generation, and am very impressed. Clearly this moves a lot of things in our world (not necessarily in a healthy direction).
As far as I can tell, this is fundementally not in the ballpark where AGI is lurking somewhere. It’s not even the same universe. To achieve AGI, they’d need to go back to the drawing board and start from scratch.
💯
First, most obvious thing that I noticed: everyone is wearing sunglasses. Which means they still can't model eyes, much less the world.
Even if they could model eyes properly, 30 sequential images, or 300, or 30, 000, is not a model of the world. It's a movie. And as Hollywood had proven, you don't need to be intelligent to make a movie.
What in the world is this? Poor man's Pixar?
Hi Gary, another nice article! "Create a clip where there is absolutely no elephant in the room" - if it can't generate a correct static image, there is no sudden reason why it would create a correct video. Obviously, video > image > text when it comes to impressiveness, but as long as the underlying architecture is the same, no miraculous jump to reality will happen, it's all still blind numerical computation of pixel mashups - now in multiple frames.
Obviously it can be useful, can be fun, can be dangerous, etc. But the Emperor still has no clothes.
In the Tokyo short, at about 50% (don’t have a timestamp) there’s a glitch with the woman’s feet. It looks like when you jump a bit in order to walk in step with someone else, except that her pace remains uncannily smooth and steady.
Yes, there are 2 consecutive skip-steps starting at 0:28. At 0:08 the person with a white handbag behind her to the right on the screen makes two consecutive steps with her left leg. A neat trick, if you can learn it. Or if you have a natural talent of Michael Jackson...
The scale of the walking people is completely wrong, look at the woman on the left halfway along and the two people in masks at the end, and they float onto the path at the start.
Gary, my concern is not so much about the errors in generated images, but about what those errors mean -- which is that text generated could be rife with misstatements and factual errors, yet presented and, worse, accepted, as authoritative. It's a repeat of the phenomenon I observed when I first worked on an Apple II Plus, namely, that there is a widespread tendency to accept as fact "what the computer says."
Gary, have you seen this? https://www.linkedin.com/safety/go?url=https%3A%2F%2Fmodernity.news%2F2024%2F02%2F17%2Fleading-scientific-journal-publishes-fake-ai-generated-paper-about-rat-with-giant-penis%2F&trk=flagship-messaging-web&messageThreadUrn=urn%3Ali%3AmessagingThread%3A2-NzYwYjYwYjAtODFiNi01YzRiLWJiNjQtZDI5YTIwOTE4NmExXzAwMA%3D%3D&lipi=urn%3Ali%3Apage%3Ad_flagship3_notifications%3B7WT6CcSpQNeFTe9BI8hJ7A%3D%3D
it’s now been retracted, but surely sign of things to come
The fallacy is thinking that a knowledge base can be built on the basis of quantifiable units that are ordered for popularity. When did we forget the meaning of GIGO?