57 Comments

1. I think looking at a year ago to this (as MKBHD did) is pretty flabbergasting.

2. I continue to not understand *at all* how anyone is saying AGI. I don't have any mystical view of human brains, and I don't think highly of our "rational" abilities, but there is just so clearly no hint of "understanding," let alone general intelligence.

Expand full comment

It’s not like we don’t have good ideas about how to architect systems with global knowledge or an understanding of physics. See, for instance Danny Bobrow’s book Qualitative Reasoning about Physical Systems. It’s been moderately amusing and very irritating to watch the swing of the AI pendulum from “Perceptrons Good” to “Perceptrons Bad” then a decade or so later “Connectionism Good” and the slow ramp to mid 2010s and the sudden takeoff of “Connectionism is All”. The result has been the enshrining of data as the basis of intelligence and the lack of recognition of the need for knowledge and understanding. AGI it ain’t, and never can be.

Expand full comment

Just excellent!!! So then compare where it does extremely well with where it has unicorn horns through heads and 7 sided chess boards. Where is it great. Where is it nuts? What is the difference between these situations? Where is the cliff into lunacy, the trigger cases for crazy? There seems again something analogous to the uncanny valley. So close, so close.. Oh my God awful.

Expand full comment

These silly little toy systems do not "make things up." That is the Anthropomorphism Fallacy: projecting uniquely human qualities onto something that isn't human. They follow their programming and spew output.

Expand full comment

"perhaps better called failed approximations, as Gerben Wierda has pointed out" - exactly! Because it is practically impossible to capture the full joint probability distribution of the data (as it has exponential complexity in the number of data points), the so called GenAI uses autoregressive models to approximate that distribution. By conditioning on the previously generated tokens the distribution of the next token becomes one dimensional and much easier to approximate. However, the trade-off is that such an approximation is not very accurate and is very brittle because it does not guarantee any bounds on the error. Hallucinations are simply the result of the autoregressive approximation failing in a very notable way. They are not a bug but indeed a fundamental feature of the autoregressive approximation.

Expand full comment

A very good point that these systems don’t have internal models - they might have statistical patterns, but localised to token-by-token, pixel-by-pixel, frame-by-frame, I imagine? On that point, what is the mathematical difference between capturing a statistical pattern and having a world model?

For example, does the vector corresponding to “cat” in the embedding space actually represent “cat-ness” relative to other tokens - or perhaps it does, just that mere relative representation is insufficient to construct a world model? What is a world model mathematically speaking? A graph of vectors?

Apologies for the disorganised thoughts, very interesting things to wrap my head around.

Expand full comment

There seem to be two schools.

(1) even correct approximations do not signal actual understanding

(2) if the approximation is as good as the result of understanding, it *is* understanding (I suspect Sutskever, Hinton; LeCun is more guarded nowadays).

(2) is often combined with 'humans make errors too' (so: because GenAI makes errors and we make errors and we understand then GenAI understands — a nasty fallacy) and — often implicitly — 'our neurons also approximate'.

There is a fundamental difference between the two, though. I've actually seen a child do 'the approximation thing' on arithmetic and spelling, being speedy, getting by, but making very regular errors, until 'the penny dropped' and the understanding was there and the errors were gone. A different mechanism than approximation had been activated. GenAI doesn't have 'the penny drops'. And that makes it incapable of understanding errors in the first place, because it has no reference to decide if something is an error.

There is a (3): even if (2) would be true, the technology is incapable of reaching the amount of approximation required to mimic understanding so close that we would see it as understanding, because it requires an unbelievable scale. (3) is supported by OpenAI's own numbers.

Expand full comment
Comment removed
Feb 17
Comment removed
Expand full comment

I guess the question is, what is this instinctive world model mathematically? As in, we humans are also only “trained” on pictures (and touch and sound for those lucky) of cats. Do we only have statistical patterns of cats in our minds, or is our world model of cats something other than statistical pattern? If so, what is the mathematical representation of it? Bear in mind that “cat” can be swapped out for anything knowable in our mind.

Expand full comment

"As in, we humans are also only “trained” on pictures (and touch and sound for those lucky) of cats."

No, we are not. If we were, we'd almost certainly make the same errors. We are trained on the *experience* of cats (all of which includes pictures) but also includes touch, smell, hearing and the spacio-temporal alignment of cat behaviour. Observing a cat chasing a mouse with our own senses will give us much more information than any of the individual data channels could. A cat as a "world concept" has agency, and this agency becomes apparent only through the cats embodyment as well as your own (embodies) ability to take the cat's perspective. This is very apparent - like Gary and Gerben correctly point out - when there is a agentic discontinuity in the depiction. Let's take the following example: the mouse gets into their burrow and the cat *decides* to stop waiting. The *cause* of the cat's behaviour might be dozens or (as is the case with patient, proficient, or just hungry cats) hundreds of frames in the past - i.e. supressed exponentially by all the irrelevant, but spacio-temporally closer data. In order to solve this, the attention mechanism (which already has been sparsified to no avail) would have to grow *faster* than the input space. This - and I don't say this lightly, but only with a substantive degree of consideration of the actual physical state space generating either training data - is fundamentally physically impossible going all the way back to the First Law of Thermodynamics; at a certain point you have so many possible situations that might have generated your observed situation, that no amount of attention can dissolve (or in case of GenAI: diffuse!) them. It's the same with any kind of prediction system: They may work well locally (say, the weather forecast), but on a large scale (i.e. a week or two), they just cannot do anything.

(Yes I know, weather forecasting has also been improved by Transforers, but if you look close enough, these systems share the same challenge: for these system we'd call global instead of local transitions a phase shift and when they occur, the models are regularly *worse* than classical approaches, because they make the same mistake: attributing that no (or slightly perturbed) change is more likely than big change - until it's not, of course.)

The takeaway (I guess) is the following, and Gary has it right here: These systems are good at local approximations, but global approximations are either (in the specific cases laid out) impossible or so costly, that getting the last percent in order to no longer be "uncanny" is not economically viable.

When considering AGI, it usually comes down to this: The assumption is (simplifying the so-called scaling hypothesis) that you have unlimited compute, unlimited energy and unlimited training data. Neither of these can be true, they are physically impossible (if only because actual large-scale parallel computing will run into latency problems[*]) to realize.

[*] The latency problem: Say you can arrange a hardware setup that is able to put out as much compute as you want - you have heat corresponding to these computations at *least* at the scale of the Landauer principle - it is small, but finite. Now, in order to have your compute-sections speak with each other in a synchronized manner, they must not be too far apart. At ~4.0 GHz, the speed of light dictates that in order to have 1 ms latency (which is already high) your hardware must be within a cube of ~ 7.5 cm. If you power that up without sufficient cooling, it vaporizes instantly; the only solution theoretically possible is to put that kind of computation in a (physically) degenerated neutron star - with all pros and cons (the cons outweigh the pros obviously). My conjecture: Some time in the future we will find that the optimal configuration for General Intelligence is a design the size of the humane brain; if it is too small, the capacity is too narrow; if it is too large, it cannot function due to latency/synchronicity problems. (We see this with folks like Nietzsche and Dostojewski: Too much brain volume increases likeliness of epileptic seizures...)

Expand full comment

I showed the unicorn picture to my kids. They noticed the hand first, then a full three seconds later, a loud shriek when they realised the horn! Thanks Gary - your research is a great teaching tool about AI mistakes (and helps enriches my kids' general knowledge too)! =D

Expand full comment

Wait til you show them Sora videos :)

Expand full comment

And a bit later, the main woman's legs pass through each other during a step.

Also note the woman dressed in white a bit behind the main one, and notice how her legs do odd things.

Note also, I suspect that OpenAI consider this video to be their flagship demonstration of how wonderful Sora is in spite of these really important flaws that will not impress fill directors.

Expand full comment

The monkey’s face resets entirely about midway through the video, too. Right about when he turns his head to look away from the viewer for the second time. This doesn’t speak to a great understanding of object permanence to me.

Expand full comment

*not the second time, sorry. You’ll notice his eyes go from being closed to open without the act of his lids actually opening, though.

Expand full comment

From what OpenAI have said, it is a kludge of GPT4 that takes the human "natural language" prompt and then turns it into the full set of prompts and inclusion files that are required to drive Dall-E 3. Nice and simple and reuses their expertise.

We have argued to death the failings and minimal capabilities of GPT4 in terms of world knowledge and physics and engineering. We are now seeing how the Dall-E 3 diffusion engine places the "patches" in the canvas and then refines the quality of the pixel content and looses its marbles.

The OpenAI explanation is clear about the process of laying out the overall canvas for the video and then using patches, next patch prediction and then uses some form of next pixel prediction in an iterative process until the quaity is good enough.

So, as we understand GPT4 and Dall-e 3, we know the limitations of Sora, and we also now know how little we can expect from it..

Expand full comment

"I am not complaining, mind you, about the fact that a monkey might pose in front of a chessboard, but about the chessboard itself: 3 kings, only one on the board, and a 7x7 rather than nearly universal 8x8 chess board (to say nothing of the pawn structure). Utterly impossible in chess—and presumably nowhere in the data set, either. Yet it rendered the image photorealistically."

That's astounding.

Superb analysis, Gary. Even a layperson like myself can follow it %-0

"It’s probably not fair to blame the weird board on a lack of data. Judging by the quality of its images, Sora was trained on an immense amount of data, including data about how things change over time. There are plenty of videos of people playing chess out there on the web. Chess boards don’t morph from 8x8 to 7x7 in the real world, and there are probably tons of 8x8 boards in the databases - and few if any 7x7. What Sora is doing is not a direct function of its training data."

So the generative AI program is "learning", in some sense. It's just that it's inertially prompted to ask the wrong questions, and then--correct me if I'm wrong--to accept the answers without question (because when is the last time a computer ever rejected any input that was formally correct?), and then act on what it has "learned." With the preconceived (if not relevant or proper) set of learned parameters in place, the rest is up for grabs.

Although, hmm, what's more formally correct than the logic of a chess game? if AI had any spark of its own, the program set would have gotten in touch with its uncanny affinity with formal logic, and then roamed the www. to scrape Deep Blue (or what have you) at least far enough to "know" that there's no possible way for the game of chess to work with only 7 x 7 squares (or with three kings- !)

I don't know what was in the prompts for that chimp-playing-chess-in-the-park pic, but if the word "chess" was a keyword, I would have expected at least some hint of "aha!" in response from a Major League Artificial Intelligence program.

Except that world-beating AI chess programs evidently don't have any innate spark of their own, either. They have no secrets to impart to a newbie AI program. AI chess programs--so I'm told--have found that the most effective way to win the game is "brute force" approach that maximizes all outcomes and all possible courses of result in order to calculate the next move. I think "brute force" is a particularly ungainly anthropomorphism for the approach. I think a better phrase is something like "probability field theory". A better place to take the programming ideal, at least for a chess program.

Given a game with the formal outline and restrictions of chess, a computer finds the rules easy to train to. The algorithm just doesn't care if it never has to play another game of chess in its entire existence (however long THAT might be.) Even a generative AI algorithm is indifferent, to everything. Indifferent. To everything.

The masterfully invulnerable AI chess program not only doesn't care about the game of chess, it doesn't know what "chess" is. It isn't going to challenge ChatGPT11.0 to a game of chess. Or vice versa, either.

Expand full comment

🙏

Expand full comment

Good one :)

Pixel-by-pixel calculations, or word by word ones, using numerical embeddings as inputs, have NO meaningful relation to truth!

Any truth that does emerge (and a lot, does) is solely on account of its locality in the embedding (eg word order, chessboard squares) - it is incidental, not purposeful; it is an artifact of the computation on the data, not reasoned.

Multi-modal 'hallucination', esp in visual form, points out the underlying absurdity of it all - about the magical wishing for meaning.

Expand full comment

The pieces are also way too big for the board -- the bottoms of the pieces wouldn't fit within the chess squares. That's definitely not something that has ever been seen in the training set.

Also the white king at far left appears to be resting both on the board (which is maybe 1/2 inch thick?) and on the table below the board at the same time -- but somehow is not tilted. The chess board both does and does not have thickness. A little like the girl with both her hand and her feet on the surfboard.

Expand full comment

Gary - I work in the film industry in Vancouver. I am watching those who have bought into the generative AI hype very carefully, and am frankly a bit flabbergasted at how little critical thinking they are putting into Sora's announcement - preferring instead to declare it "the death of Hollywood," etc. Many cite producer Tyler Perry's recent announcement that he is going to put an $800M expansion of an Atlanta studio facility on hold.

What follows below is a recent comment I left under a YouTube video. I doubt many will read it all (and fewer still will accept it), but it's my analysis based upon both my knowledge of the film industry, and reading the evaluations of AI experts such as yourself >

There appears to be some vast misconceptions about how this generative AI technology works (or more precisely, doesn't work).

For quick clips, commercials, video bites, animation, VFX augmentation, pre-vis and graphic and production design work, yes, this will definitely have a major impact. However, for making entire films and television, that's a whole other level of complexity that scaling and advancements using the current generative models will probably not be able to achieve. Meanwhile, it doesn't matter how good you are at "super prompting," with current generative AI models you cannot get consistent, precise, repeatable results every time...for all the various matching shots you would need in an episodic series or motion picture.

I'm a bit surprised folks aren't reading more deeply about how the underlying generative AI models work, frankly. Even the experts designing them don't fully understand what's going on inside the black box, and after years of gradually developing these models they are no closer to solving the hallucination problem than they were at the outset. The photorealism is getting better, but the AI's understanding of the physicality of the real world is not improving at all. And, no, RAG (Retrieval Augmentation Generation) is probably not gonna fix the underlying problem, either.

The bottom line is that the current model designs of generative AI, while impressive on the surface, are fraught with unreliability and inconsistency. And those problems don't seem to be getting resolved. And that's anathema to major motion picture or series production.

And that's not getting into the myriad other technical, logistical, legal, economic and sociocultural hurdles that this technology is beginning to face.

As to the "democratization" argument, folks have been saying that for years, every time a new technology arrived. It was said about digital photography, it was said about capturing 4K video on phones, it's been repeated ad nauseam about YouTube. But what everybody forgets about all this "empowerment" is that you end up with a lot more noise, and it just becomes that much harder to find the signal. How many YouTubers, for example, ever make it to one million subs? You still have to separate yourself from the crowd.

One thing Hollywood already has, is the entrenched distribution channels and PR machinery. And that's not going anywhere anytime soon. The other competitive advantage Hollywood enjoys revolves around a word you're going to hear a lot more of as fakery permeates every corner of the Internet: "Authenticity." In other words, the capability to make movies and series that star real human beings interacting in real environments. Authenticity will become the new currency of the realm as fakery becomes pervasive...and quickly maligned.

Again, people aren't investigating this generative AI stuff carefully enough. They're just buying all the hype. And, of course, companies like OpenAI love that, as it makes their valuations go through the roof.

Expand full comment

I don't know if anyone has mentioned it, but in the walking woman video there is a major continuity problem after the cut to close up: her dress has new designs at the top and her left lapel is now twice as long.

As for monkeys playing chess, these rash of "amusing" animal videos are revealing a disturbing lack of ethics at OpenAi. There is a particularly grotesque animal video called Bling Zoo showing a tiger agonizing in a cramp cage, while in a second shot a turtle eats a string of diamonds, something that would kill a real animal. No sane person would ever think to make let alone showcase a real video like this.

Expand full comment

This type of arcade-like F1 driving should have resulted in a fiery explosion. Check out the hallucinated mirrors too.

https://youtube.com/shorts/1On4rEBbwcw?si=nCRDPzhkIFF8S1Vr

Expand full comment

Is it really Sora? Looks very different from all the others (and has sound).

Expand full comment

It's right from OpenAI's official TikTok account. My bad, I should have posted the direct link (instead of the YouTube one): https://vt.tiktok.com/ZSFkWc1Xt/

Looks like they've added audio as well.

Also - kudos to OpenAI for acknowledging their mistakes. I wonder if this was prompted by your article, Gary =D

https://www.tiktok.com/@openai/video/7339740660116835627

Expand full comment