36 Comments

Hi Gary! Nice examples and post...

We shouldn't be surprised at any of this - there is zero understanding that can be gained just from sequences of text, sequences of pixels (imgs), sequences of imgs (video), sequences of phonemes/notes/... That's why, doctors, mechanics, plumbers, cooks, musicians, dancers, athletes and soldiers don't just watch videos and read books to learn how to do things.

Expand full comment

Excellent point. Children do not learn well from videos, either, unless an adult interacts with them.

Expand full comment

Amy, indeed! The role of a trusted caregiver is so trivially ignored by the AI community.

Also, children are embodied, learn continuously and interactively, learn directly (active, not passive) and most importantly, physically :)

Expand full comment

Saty,

I understand and agree with your reply, based on “current models” of how the human brain works. None of those models, however, can explain consciousness. In a new model I’ve developed, there is a very different way to think about your point.

In my model, the brain is NOT a spreading neural net (which is the LLM model as well). It is a “massively parallel serial data store”. That is, after preliminary processing in the parietal lobe, each sensory input, at the individual sensor level, is recorded as a “real time” continuous data string. This is like the old “bubble memory” architecture. The entire history of a person’s experience, for every sensory nerve, is stored! All along this nerve path, there are cells that make a “real time” comparison between the data they are sensing and what is newly arriving.

This produces 2 important results. 1. It explains how we can recall any experience in our entire life in less than a second. 2. It explains how our memories appear to be “full experiences”. That is, when a new experience arrives, the activated “comparison” cells trigger all the other comparison cells that are sensing their recorded data at the same time point! This second result directly leads to our “memory” consciousness. It also explains dreams, and the reality of hallucinations and PTSD.

Related to your observation of “sequences of text… pixels … imgs …. Phonemes…” my model agrees that, a short sequence, by itself, provides zero understanding. BUT, a very different phenomenon occurs in the brain when each word or short sequence of images, simultaneously, finds a lifetime of matches in the brain that are also made available for comparison. It is this “memory response system” that defines the “doctor” or “plumber”. Significantly, it provides a very pertinent model to explain the multiple ways the current LLM architectures are headed in the wrong direction.

Expand full comment

Bruce, your architecture sounds extremely *plausible*, brain-like [unlike NNs that are brain-like in name only!]. Curious how you represent experiences, including non-verbal ones (eg the experience of tripping on a rock, or feeling the wind - there's no data, or more generally, no symbols in these).

Expand full comment

Saty,

In my model, I view memories as “complex, streaming, associated, time-based, “perception experiences”. That is, what both humans and vertebrate animals “perceive” as an “experience” is the flow of all sensory signals through some pretty limited place in the brain, possibly at the “output” of the parietal lobe. Tripping on a rock, or feeling the wind, is simply the “experience” of the flow of signals from the entire body’s sensory system after the parietal lobe has processed them into a spatially coherent organization. As you say, there are no symbols at that point.

Now, continue the experience of “feeling the wind”. The parietal processed sensory nerve signals from the million sensors on the surface of the skin, that feel the fluttering coolness, are individually “streamed” into the volume of the brain that “memorizes” skin sensation. They are “output” into this volume as individual real-time data streams in exactly the same way tactile cells code their sensations. At the same time, on what computers would call a “parallel buss”, all those thousands of immediate sensations are broadcast individually to the chain of nerves memorizing the history for that individual nerve, for the entire sensory brain. When similar matches are detected, for the entire life memory of the individual animal, that prior experience is also sent back, in parallel fashion, to the parietal lobe, which just considers them an “associated” experience.

The result is, when we “feel” the wind, our primary “sensation” is just the signals from our skin cells. BUT, for every past experience, that the “comparator” nerves determine to be similar, those past experiences are blended into the current sensation. AND, what is actually stored as the current experience, is the BLENDED experience.

To address the “symbolic” issue, while the skin sensing process is occurring, other parts of the brain are simultaneously capturing input from all the other sensory systems: vision, balance and motion, temperature, sound, etc. The complex factor we call emotion is also being captured. A special part of the brain, that make us "human", is simultaneously capturing summaries of all of these.

A way to understand “symbology”, in this context, is to view “symbols” as just “unique” combinations of multiple sensations that we have been “trained” to memorize as important for particular experiences.

Expand full comment

If DALLE doesn’t get it right the first time, no amount of prompting improves it except dumb luck. It has no idea what it is depicting.

Expand full comment

It still amazes me that anyone can say with a straight face that understanding is emerging in LLMs when more data is poured in and more parameters are used. It’s very clear from these examples that there is no understanding at all, nor has the ability of GPT-4/DALL-E to deal with these sorts of problems got significantly better. I see no reason to expect any LLM, no matter how jerryrigged with RAG or adversarial overseer systems, to get it right.

Expand full comment

My favorite test is to ask the image model to do an overhead view of a baseball field and put an apple in left field near the warning track. Have yet to find one that can do it. Not to mention if you ask it to move the apple from - usually center - to the right side of the image. I've found nothing that can manipulate a specific object in the first prompt. Text can be rough. Images can be just junk.

Expand full comment

With that in mind, how do you think video does? Audio is trying, but.....

Expand full comment

What irks me most about the bicycle diagrams is that in all three there is a label of a tire pointing to the GROUND...

Expand full comment

Maybe we need to change the language we use when describing these issues. They have more than "trouble" or "problems" with compositionality. They just don't do compositionality. Furthermore, they have no logic or algorithms that one would reasonably expect to handle compositionality. Same with LLMs and truth. If they did deal with compositionality or truth, they'd have some explaining to do.

Expand full comment

I tried unsuccessfully to generate a picture of a group of kids reading books with thought bubbles about their phones. No can do. ¯\_(ツ)_/¯

Expand full comment

Dear colleagues, while I fully agree with your findings, I'm puzzled by the sentiment.

You write: "Relating language and the visual world robustly may require altogether new approaches to learning language."

Why do you think that the current technology (all of diffusion, word2vec as well as GPT-based LLMs are fundamentally known and I don't believe that any party has made meaningful structural changes) *would* be able to deal with these kinds of prompts in the first place?

To me it doesn't just seem to be the case, but it is virtually certain that neither CLIP nor GPT has any world-model-like epistemology and IT CANNOT HAVE. Why do you suppose it may?

Expand full comment

i don’t suppose that it would. but a lot of people are confused, and it’s good to remind them that scaling is not a panacea.

Expand full comment

Former PayPal executive David Sacks has been tapped by Trump to be the "AI and crypto czar" for Trump’s upcoming administration. Sacks' approach? Full deregulation and let the market do its thing.

🤦

Expand full comment

When I use AI to write novels, I segment the problem, and have an ai which creates a matrix of prompts which do the detail writing iteratively. Generally, it works flawlessly, and I make the process very easy to rewrite sections which don’t “work” upon an edit read.

Sometime consider how a sculptor, photographer (other than reportage), painter or illustrator approaches making a work. There is an idea, a rough sketch or composition level rendering, then successive levels of adjustment to the finished work.

It’s the same with most human works, a projection and successive refinement which adds more and more information.

The most successful composition controlled ai will start with a layout, then use that with stylistic controls to render successive approximations to a finished image.

Sometime watch a documentary or read the process of filming, which has extensive storyboarding for scene realization long before it goes to actual filming.

I suspect you will see using an AI to naturalistically describe a scene composition (objects described and positioned in 3D space with 2D rendering) like “sphere diameter x lower left quadrant, rabbit figure… and 3D language objects will be positioned until accepted (Maya) then final rendering will create the surprising style or photorealistic detail.

The way I would do it is to ask for blender configuration to create a scene with a “rabbit with 4 ears”, render the line object space, then transform the locked position 2D rendering into the Dall-E Style space. There are very good tools to stabilize characters and do supepositional content rendering.

It allows extremely fine-grained positioning of every object, then you can sequentially do the final 2D render in the AI projective space.

In all my work with AI, they cannot handle physics well either linguistically or visually, a severe embodiment problem. We tend to function linguistically within our embodied physics which shapes our own synaptic logic, which makes nonsensical detection trivial.

Relational embodied physics is not linguistically solvable I’m afraid in the current generation. There is insufficient training in obvious object physics in the generative layers so they cannot tell that something statistically linguistically or visually is not statistically possible physically.

For instance, if were to have characters eat a horse non-metaphorically, a text generator would happily describe impossible things happening in the story. When it happens lightly (In the Lord of the Rings, what sustained that large group for a year, Lembas? Elvin leaves?) you’re not aware of it. When it condenses in a story accidentally, the AI merrily makes obviously physically impossible things happen.

I struggled for hours trying to get illustrations of Hercules, a Titan, Satyrs and people - human figures at different scales and limbs - to be automatically generated illustrating a series of books I did extending the Satyr plays of Euripides - the only surviving complete work was “Cyclops”. In it satyrs take the place of the female chorus in other plays, which mean you have bawdy drunk self-serving characters fucking up the other heroes.

The books were fine but Hercules rescuing Prometheus from the Eagle while satyrs fucked up the rescue bring auto-illustrated was not easy - I used Open-AI to write the novel, and I had it talk to a different commercial AI for illustration, but the juxtaposition of a large Titan, a muscular Hercules, his crew of ordinary men, and the eagles in a fight was very time-consuming to describe and segment.

Expand full comment

AI-generated novels are a thing that do not need to exist.

Expand full comment

I’m afraid it’s a bit late.

I’ve generated almost 40,000 novels (and series) of varying complexity to understand where AI is weak or strong in fiction composition, how much human intercession is required for shaping, and what genres have sufficient training that it can work with them easily. I’ve been creating them since gpt-2, and ultimately have been experimenting with structure since rolled my own tools back in ‘93

I have a distributor who wants all I can generate in a particular genre; they’ll be on the market within 3 months, I just haven’t decided how clear to make to the reader that they’re AI. I have a few dozen writing styles with appropriate pen names so it appears like a consistent stable of authors with unique styles and interests.

That’s the problem when you have an engineer study English literature perhaps. They try to engineer it to be… more intensely focused at what is interesting.

Like I said, they have peculiar physics deficits which I have to have some human as copy readers look for. I can’t intercept it by an AI editor yet.

It’s quite funny when it happens, and usually can be avoided in level 2, 3, or 4 of the composition process but not always.

Expand full comment

Hopefully the law stops you at some point because that is just so disrespectful to everyone who ever took the time to write and the world doesn't need AI slop. I really don't understand your mindset.

Expand full comment

You haven’t read as much utter crap as I have, as well as stacks of wondeful writing.

You should try Barbara Cartland (if she is the author) of 700 or more “romance novel” novel fame, you know Star-crossed lovers (or is that Shakespeare), impossible odds, steamy sex, breakup and hot makeup. Or perhaps Louis L’Amour (if the author) of the male equivalent “Western Novels” - 100+ which morphed into Science Fiction (actually Engineering Fiction) - you know, stern stranger with tragic past, fighting insurmountable odds (bad winter, lawlessness, corruption), perhaps a strong independent woman for contrast, and ultimate success in a changing world. Study the Aarne-Thompson-Uther encoding system for all children’s story and world folktales sometime. I had even gpt-2 happily making the calf jump in the stew-pot for the hungry family… Hercules is a trainer at gym now in SF, Dostoevsky cranks out deep anti-hero stories of corporate warfare, Jane Austen lives in Louisiana in an extremely structured and stultifying antebellum south, and Alice, little Alice, is a futurist diving through a multiverse of echoing film characters down the rabbit hole and through the looking glass.

My next, is an epic universe combining the utterly familiar with sentient black holes, 78 magical creatures out to save existence from oblivion (every marvel and DC comic for the last 60 years), only a new twist, a bit more of entropy, a smattering of entanglement, cellular automata, with love, sex, strong women and moral men, lives created by wish and intuition.

It’s not the writing, it’s the idea, which AI can only remix, permute or mutate. It’s terrible at generating genius which is what becomes cliche. But playing with ideas, at a book an hour, that’s magic. It’s the ultimate word processor, spell check, rewriting tool, editor and critic simultaneously.

It will be out - the first few thousand I did were hardcore porn, which I was going to release with a special stable of writers - Mitch McConnell, John Thune, John Barrasso, Joni Ernst or Shelly Moore Capito, or more funnily Mike Johnson, Steve Scalise, Tom Emmer Lisa McClain, but those jokes tend to backfire. But who knows there are a lot of hot

Potential authors out there. I can see Matt Gaetz writing “Half Virgin” or “Hard Candy Christmas”

Funny you can’t really copyright a name.

Literary Fiction - Nabokov writes beautiful books on Lepidopterists obsessed with nymphs at museums, Burroughs does have a view on chemosexualtiy (15 chapters worth), and William Gibson lives for strange couplings at sex clubs in Japan.

All good fun.

Expand full comment

So you're doing it out of nihilistic spite? Porn by Mitch McConnel is funny but if you are going for anarchy there are more constructive ways to exercise that instinct than to pollute the publishing world.

Expand full comment

Fascinating. It's actually worse than I thought. And presumably the complications only multiply when we're talking about GenAI video. What ever happened, pray tell, to Sora?

Expand full comment

rumor has it is about to be released?

Expand full comment

Should we hold our breath?

Expand full comment

The model needs a contextual understanding of the nature of the images it creates not just some training data label. Does it know the images represent “parts”? Does it understand every part has its own set of unique characteristics and can only interface with other parts?

Expand full comment

Very nice examples. Especially to show that ['understanding' statistical relations between pixels of images] is only an approximation for [understanding images], whatever the size of the statistics.

It is really weird that you can see how correct these examples and interpretations are and still the belief in actual AGI-like understanding by these systems remains alive.

It seems easier to fool humans with text than it it is with images. Good grammar/sentences is a proxy for our intelligence to establish intelligence of the author and thus trust in what is told. Good grammar/sentences is much easier to approximate with token statistics than good meaning, and on many subjects there is enough token statistics to get 'good enough' results. Which is of course a fascinating result.

But good pixel statistics is even further from meaning than good token statistics and thus has a much harder job, to which the fact that we have two kinds of statistics (on the text and on the pixels) working together here. Could that be why the image 'correctness' is so much more difficult than the text 'correctness'?

Expand full comment

The people who believe this stuff is reliable and ready for prime time are suffering from even worse hallucinations than the LLMs. This tech in its current form is not only built on a house of lies and stolen art, it's basically junk.

Expand full comment

Gary,

While your points about the huge limits of AI in relation to video creation are well founded, we don’t have to look past even the simple uses of AI to show its weaknesses.

I do a lot of “transcription” editing. I have not yet found any AI that can even come close to creating an accurate transcript. For example, most Youtube videos automatically provide a transcript. If you download the version without timestamps, you get “garbage”! There isn’t even any punctuation! To manually “correct” such a transcript, it typically takes me 4 to 5 times as long as the original video to straighten it out. If anyone knows of a reliable transcription tool, please tell me. And no, not even the best Claude can do it. And this is just simple TEXT!

Expand full comment

Try having it draw an ouroboros and see what happens. https://davidhsing.substack.com/p/llms-and-generative-ai-dont-deal

Expand full comment