56 Comments
User's avatar
Matt Ball's avatar

1. I think looking at a year ago to this (as MKBHD did) is pretty flabbergasting.

2. I continue to not understand *at all* how anyone is saying AGI. I don't have any mystical view of human brains, and I don't think highly of our "rational" abilities, but there is just so clearly no hint of "understanding," let alone general intelligence.

Expand full comment
Bruce Cohen's avatar

It’s not like we don’t have good ideas about how to architect systems with global knowledge or an understanding of physics. See, for instance Danny Bobrow’s book Qualitative Reasoning about Physical Systems. It’s been moderately amusing and very irritating to watch the swing of the AI pendulum from “Perceptrons Good” to “Perceptrons Bad” then a decade or so later “Connectionism Good” and the slow ramp to mid 2010s and the sudden takeoff of “Connectionism is All”. The result has been the enshrining of data as the basis of intelligence and the lack of recognition of the need for knowledge and understanding. AGI it ain’t, and never can be.

Expand full comment
Jim Carmine's avatar

Just excellent!!! So then compare where it does extremely well with where it has unicorn horns through heads and 7 sided chess boards. Where is it great. Where is it nuts? What is the difference between these situations? Where is the cliff into lunacy, the trigger cases for crazy? There seems again something analogous to the uncanny valley. So close, so close.. Oh my God awful.

Expand full comment
A Thornton's avatar

These silly little toy systems do not "make things up." That is the Anthropomorphism Fallacy: projecting uniquely human qualities onto something that isn't human. They follow their programming and spew output.

Expand full comment
Roumen Popov's avatar

"perhaps better called failed approximations, as Gerben Wierda has pointed out" - exactly! Because it is practically impossible to capture the full joint probability distribution of the data (as it has exponential complexity in the number of data points), the so called GenAI uses autoregressive models to approximate that distribution. By conditioning on the previously generated tokens the distribution of the next token becomes one dimensional and much easier to approximate. However, the trade-off is that such an approximation is not very accurate and is very brittle because it does not guarantee any bounds on the error. Hallucinations are simply the result of the autoregressive approximation failing in a very notable way. They are not a bug but indeed a fundamental feature of the autoregressive approximation.

Expand full comment
James He's avatar

A very good point that these systems don’t have internal models - they might have statistical patterns, but localised to token-by-token, pixel-by-pixel, frame-by-frame, I imagine? On that point, what is the mathematical difference between capturing a statistical pattern and having a world model?

For example, does the vector corresponding to “cat” in the embedding space actually represent “cat-ness” relative to other tokens - or perhaps it does, just that mere relative representation is insufficient to construct a world model? What is a world model mathematically speaking? A graph of vectors?

Apologies for the disorganised thoughts, very interesting things to wrap my head around.

Expand full comment
Gerben Wierda's avatar

There seem to be two schools.

(1) even correct approximations do not signal actual understanding

(2) if the approximation is as good as the result of understanding, it *is* understanding (I suspect Sutskever, Hinton; LeCun is more guarded nowadays).

(2) is often combined with 'humans make errors too' (so: because GenAI makes errors and we make errors and we understand then GenAI understands — a nasty fallacy) and — often implicitly — 'our neurons also approximate'.

There is a fundamental difference between the two, though. I've actually seen a child do 'the approximation thing' on arithmetic and spelling, being speedy, getting by, but making very regular errors, until 'the penny dropped' and the understanding was there and the errors were gone. A different mechanism than approximation had been activated. GenAI doesn't have 'the penny drops'. And that makes it incapable of understanding errors in the first place, because it has no reference to decide if something is an error.

There is a (3): even if (2) would be true, the technology is incapable of reaching the amount of approximation required to mimic understanding so close that we would see it as understanding, because it requires an unbelievable scale. (3) is supported by OpenAI's own numbers.

Expand full comment
User's avatar
Comment removed
Feb 17, 2024
Comment removed
Expand full comment
James He's avatar

I guess the question is, what is this instinctive world model mathematically? As in, we humans are also only “trained” on pictures (and touch and sound for those lucky) of cats. Do we only have statistical patterns of cats in our minds, or is our world model of cats something other than statistical pattern? If so, what is the mathematical representation of it? Bear in mind that “cat” can be swapped out for anything knowable in our mind.

Expand full comment
Fabian Transchel's avatar

"As in, we humans are also only “trained” on pictures (and touch and sound for those lucky) of cats."

No, we are not. If we were, we'd almost certainly make the same errors. We are trained on the *experience* of cats (all of which includes pictures) but also includes touch, smell, hearing and the spacio-temporal alignment of cat behaviour. Observing a cat chasing a mouse with our own senses will give us much more information than any of the individual data channels could. A cat as a "world concept" has agency, and this agency becomes apparent only through the cats embodyment as well as your own (embodies) ability to take the cat's perspective. This is very apparent - like Gary and Gerben correctly point out - when there is a agentic discontinuity in the depiction. Let's take the following example: the mouse gets into their burrow and the cat *decides* to stop waiting. The *cause* of the cat's behaviour might be dozens or (as is the case with patient, proficient, or just hungry cats) hundreds of frames in the past - i.e. supressed exponentially by all the irrelevant, but spacio-temporally closer data. In order to solve this, the attention mechanism (which already has been sparsified to no avail) would have to grow *faster* than the input space. This - and I don't say this lightly, but only with a substantive degree of consideration of the actual physical state space generating either training data - is fundamentally physically impossible going all the way back to the First Law of Thermodynamics; at a certain point you have so many possible situations that might have generated your observed situation, that no amount of attention can dissolve (or in case of GenAI: diffuse!) them. It's the same with any kind of prediction system: They may work well locally (say, the weather forecast), but on a large scale (i.e. a week or two), they just cannot do anything.

(Yes I know, weather forecasting has also been improved by Transforers, but if you look close enough, these systems share the same challenge: for these system we'd call global instead of local transitions a phase shift and when they occur, the models are regularly *worse* than classical approaches, because they make the same mistake: attributing that no (or slightly perturbed) change is more likely than big change - until it's not, of course.)

The takeaway (I guess) is the following, and Gary has it right here: These systems are good at local approximations, but global approximations are either (in the specific cases laid out) impossible or so costly, that getting the last percent in order to no longer be "uncanny" is not economically viable.

When considering AGI, it usually comes down to this: The assumption is (simplifying the so-called scaling hypothesis) that you have unlimited compute, unlimited energy and unlimited training data. Neither of these can be true, they are physically impossible (if only because actual large-scale parallel computing will run into latency problems[*]) to realize.

[*] The latency problem: Say you can arrange a hardware setup that is able to put out as much compute as you want - you have heat corresponding to these computations at *least* at the scale of the Landauer principle - it is small, but finite. Now, in order to have your compute-sections speak with each other in a synchronized manner, they must not be too far apart. At ~4.0 GHz, the speed of light dictates that in order to have 1 ms latency (which is already high) your hardware must be within a cube of ~ 7.5 cm. If you power that up without sufficient cooling, it vaporizes instantly; the only solution theoretically possible is to put that kind of computation in a (physically) degenerated neutron star - with all pros and cons (the cons outweigh the pros obviously). My conjecture: Some time in the future we will find that the optimal configuration for General Intelligence is a design the size of the humane brain; if it is too small, the capacity is too narrow; if it is too large, it cannot function due to latency/synchronicity problems. (We see this with folks like Nietzsche and Dostojewski: Too much brain volume increases likeliness of epileptic seizures...)

Expand full comment
Simon Au-Yong's avatar

I showed the unicorn picture to my kids. They noticed the hand first, then a full three seconds later, a loud shriek when they realised the horn! Thanks Gary - your research is a great teaching tool about AI mistakes (and helps enriches my kids' general knowledge too)! =D

Expand full comment
Gary Marcus's avatar

Wait til you show them Sora videos :)

Expand full comment
Richard Self's avatar

And a bit later, the main woman's legs pass through each other during a step.

Also note the woman dressed in white a bit behind the main one, and notice how her legs do odd things.

Note also, I suspect that OpenAI consider this video to be their flagship demonstration of how wonderful Sora is in spite of these really important flaws that will not impress fill directors.

Expand full comment
Liam Scott's avatar

The monkey’s face resets entirely about midway through the video, too. Right about when he turns his head to look away from the viewer for the second time. This doesn’t speak to a great understanding of object permanence to me.

Expand full comment
Liam Scott's avatar

*not the second time, sorry. You’ll notice his eyes go from being closed to open without the act of his lids actually opening, though.

Expand full comment
Richard Self's avatar

From what OpenAI have said, it is a kludge of GPT4 that takes the human "natural language" prompt and then turns it into the full set of prompts and inclusion files that are required to drive Dall-E 3. Nice and simple and reuses their expertise.

We have argued to death the failings and minimal capabilities of GPT4 in terms of world knowledge and physics and engineering. We are now seeing how the Dall-E 3 diffusion engine places the "patches" in the canvas and then refines the quality of the pixel content and looses its marbles.

The OpenAI explanation is clear about the process of laying out the overall canvas for the video and then using patches, next patch prediction and then uses some form of next pixel prediction in an iterative process until the quaity is good enough.

So, as we understand GPT4 and Dall-e 3, we know the limitations of Sora, and we also now know how little we can expect from it..

Expand full comment
Saty Chary's avatar

Good one :)

Pixel-by-pixel calculations, or word by word ones, using numerical embeddings as inputs, have NO meaningful relation to truth!

Any truth that does emerge (and a lot, does) is solely on account of its locality in the embedding (eg word order, chessboard squares) - it is incidental, not purposeful; it is an artifact of the computation on the data, not reasoned.

Multi-modal 'hallucination', esp in visual form, points out the underlying absurdity of it all - about the magical wishing for meaning.

Expand full comment
Kent's avatar

The pieces are also way too big for the board -- the bottoms of the pieces wouldn't fit within the chess squares. That's definitely not something that has ever been seen in the training set.

Also the white king at far left appears to be resting both on the board (which is maybe 1/2 inch thick?) and on the table below the board at the same time -- but somehow is not tilted. The chess board both does and does not have thickness. A little like the girl with both her hand and her feet on the surfboard.

Expand full comment
Robert Keith's avatar

Gary - I work in the film industry in Vancouver. I am watching those who have bought into the generative AI hype very carefully, and am frankly a bit flabbergasted at how little critical thinking they are putting into Sora's announcement - preferring instead to declare it "the death of Hollywood," etc. Many cite producer Tyler Perry's recent announcement that he is going to put an $800M expansion of an Atlanta studio facility on hold.

What follows below is a recent comment I left under a YouTube video. I doubt many will read it all (and fewer still will accept it), but it's my analysis based upon both my knowledge of the film industry, and reading the evaluations of AI experts such as yourself >

There appears to be some vast misconceptions about how this generative AI technology works (or more precisely, doesn't work).

For quick clips, commercials, video bites, animation, VFX augmentation, pre-vis and graphic and production design work, yes, this will definitely have a major impact. However, for making entire films and television, that's a whole other level of complexity that scaling and advancements using the current generative models will probably not be able to achieve. Meanwhile, it doesn't matter how good you are at "super prompting," with current generative AI models you cannot get consistent, precise, repeatable results every time...for all the various matching shots you would need in an episodic series or motion picture.

I'm a bit surprised folks aren't reading more deeply about how the underlying generative AI models work, frankly. Even the experts designing them don't fully understand what's going on inside the black box, and after years of gradually developing these models they are no closer to solving the hallucination problem than they were at the outset. The photorealism is getting better, but the AI's understanding of the physicality of the real world is not improving at all. And, no, RAG (Retrieval Augmentation Generation) is probably not gonna fix the underlying problem, either.

The bottom line is that the current model designs of generative AI, while impressive on the surface, are fraught with unreliability and inconsistency. And those problems don't seem to be getting resolved. And that's anathema to major motion picture or series production.

And that's not getting into the myriad other technical, logistical, legal, economic and sociocultural hurdles that this technology is beginning to face.

As to the "democratization" argument, folks have been saying that for years, every time a new technology arrived. It was said about digital photography, it was said about capturing 4K video on phones, it's been repeated ad nauseam about YouTube. But what everybody forgets about all this "empowerment" is that you end up with a lot more noise, and it just becomes that much harder to find the signal. How many YouTubers, for example, ever make it to one million subs? You still have to separate yourself from the crowd.

One thing Hollywood already has, is the entrenched distribution channels and PR machinery. And that's not going anywhere anytime soon. The other competitive advantage Hollywood enjoys revolves around a word you're going to hear a lot more of as fakery permeates every corner of the Internet: "Authenticity." In other words, the capability to make movies and series that star real human beings interacting in real environments. Authenticity will become the new currency of the realm as fakery becomes pervasive...and quickly maligned.

Again, people aren't investigating this generative AI stuff carefully enough. They're just buying all the hype. And, of course, companies like OpenAI love that, as it makes their valuations go through the roof.

Expand full comment
Robert's avatar

I don't know if anyone has mentioned it, but in the walking woman video there is a major continuity problem after the cut to close up: her dress has new designs at the top and her left lapel is now twice as long.

As for monkeys playing chess, these rash of "amusing" animal videos are revealing a disturbing lack of ethics at OpenAi. There is a particularly grotesque animal video called Bling Zoo showing a tiger agonizing in a cramp cage, while in a second shot a turtle eats a string of diamonds, something that would kill a real animal. No sane person would ever think to make let alone showcase a real video like this.

Expand full comment
Simon Au-Yong's avatar

This type of arcade-like F1 driving should have resulted in a fiery explosion. Check out the hallucinated mirrors too.

https://youtube.com/shorts/1On4rEBbwcw?si=nCRDPzhkIFF8S1Vr

Expand full comment
Gary Marcus's avatar

Is it really Sora? Looks very different from all the others (and has sound).

Expand full comment
Simon Au-Yong's avatar

It's right from OpenAI's official TikTok account. My bad, I should have posted the direct link (instead of the YouTube one): https://vt.tiktok.com/ZSFkWc1Xt/

Looks like they've added audio as well.

Also - kudos to OpenAI for acknowledging their mistakes. I wonder if this was prompted by your article, Gary =D

https://www.tiktok.com/@openai/video/7339740660116835627

Expand full comment
Anyways's avatar

Do you mind explaining further what you mean by this:

“Rather, Sora is failing to apprehend a cultural regularity of the world, despite ample evidence.“

Expand full comment