Human language has two main components. There's one component for generating or parsing its structure (sometimes called the E system for expression) and there's another component for linking words to their meanings (sometimes call the L system for lexical). Birds, in particular, have extensive syntactic systems for generating and recognizing bird songs. Dogs, for example, can learn to associate words with actions or objects. Human language extensively combines the two, and having a syntactically structured language is much more powerful than having either component alone. The placement of a word within the syntactic structure can dramatically alter the meaning of a sequence of words. It's rather obvious that these AI system don't get this.

If you have ever diagrammed sentences in any human language, you'd realize that there is a structure of words and phrases modifying words words and phrases. Natural languages allow a deep level of expression with these modifiers modifying modifiers. You can extend expression, even within a sentence, arbitrarily. Humans can learn this from a training set because their brains have this structure built in just as they have built in components for thinking about location, time, meaning, association, sequence, variation, change and so on. I seriously doubt that a system with limited neural depth and none of those components built in can do anything like this.

If you look at the published examples, it is rather obvious that they can't. Reversing the order of two nouns with respect to a preposition shouldn't stymie a system this way. I think these systems might be useful the way Applescript is useful. It looks enough like English to be relatively easy to understand, but it is miles away from natural language on closer experimentation.

Expand full comment

Bravo! The whole hype around supposedly AGI has very squishy notions of "intelligence". Take Ambrogioni's reaction for example: a big part of human intelligence is imagination --- imagining the impossible, imagining the absurd, making up stories, fantasizing about matters mundane and profound, etc.. So how is the failure at imagination proving the existence of general human level intelligence? What Ambrogioni was saying, is at the best a very narrow and lopsided understanding of intelligence, and at the worst reflects a tunnel-vision on what AI is and can be. It is essentially path dependence on bigger models, and this path dependence is really sucking up all the air for what really matters in understanding and developing human-level AI.

Understanding and knowing what words mean are central elements of human-level intelligence, and we still do not seem to have those in DALL-E and Imagen.

Expand full comment
May 28, 2022·edited May 28, 2022

Indeed. The system has no clue what any of it means - the words, or the imagery. Also, there is no creating (of the imagery), only computation.

Stepping back, it's clear why - there is this disembodied algorithm that has no real-world, first-hand experience - of feeling gravity, looking at the moon, understanding why astronauts exist (and why they are 'cool'), riding on a real or toy horse, the absurdity of a horse riding a human, blocks, stackability, colors... NOTHING. Instead, it is trained using gobs of text DATA generated by humans, and image DATA that is labeled by humans. The data processing can be sophisticated, but it's still data-based, still computation.

The physical world has matter, energy, information (configurations/structures/assemblies...) - out of these result phenomena. Structures -> phenomena. Embodied life is also, structures -> phenomena (from the sub-cellular level to the organ to the body to the collective levels). Intelligence is a set of phenomena that help the body survive. Unless A(G)I is based on similar principles, we will continue to have top-heavy (ie. human-derived) second-hand 'intelligence' that gets better at fooling, but is still inadequate at the most fundamental level: lacking understanding.

Paraphrasing the comment from Prafulla (DALL-E 2 co-author): “Our aim is to create general intelligence. Building *embodiments* (unlike DALL-E 2) that *experientially* connect vision and language is a crucial step in our larger goal of teaching *such embodied* machines to perceive the world the way humans do, and eventually developing AGI.”

Expand full comment

Our words become meanings for our listeners, but our words don’t know themselves. Sound cannot hear itself.

Expand full comment

Found this googling to find out why I keep hearing astronaut riding a horse as the common prompt suggestion (and got my question answered apparently).

To the discussion here, it's news to me and I'm not sure I believe it that people widely thought these systems could understand English grammar very fully or were literally like a human intelligence? But regardless it seems to me based on my own experimentation that the horse always being the thing ridden with a basic grammar is the real accomplishment.

As this article shows, it normally mixes up which thing is being described, and not all languages (and frankly many intentional prompts I use don't do this either) assume the noun before a verb is the subject as this article seems to entirely assume. In some languages there's something in the nouns that indicates it (English often lacks this so we usually stick with an order of words), and with at least some verbs (speaking as an English major), either order is acceptable. It seems like it's noticed that horses are in relationships to things that in other contexts are alone consistently, like "a man riding a horse" and another image will have just "a man." That alone is the milestone, is it not?

And that in turn implies it does understand the basic syntax of "riding," or else the original prompt should have had the astronaut as the thing being ridden about as often as common prompts like "orange cube over blue sphere" get the colors and shapes mixed up. It probably treats "riding" no different than a noun ("rider is the thing above") rather than understanding the verb per se, from what I've seen, but that still is a pattern recognition that goes way beyond what we had before this technology.

In my prompts I automatically assumed the computer wouldn't likely understand grammar (if you've bought an Alexa you already know we're not there yet), and I dispense with English grammar usually because it's kind of backward on what's the most important thing to know. If you want to say a building that's, let's say, orange, with, say, gold dots, the short way in English is "gold-dotted orange building." I would prompt "building orange, dots gold". (And even then it gets confused a good percentage of the time, as I'd expect.)

Expand full comment

So, your point is, that "some interpreter of language" is not capable* if they don't understand your wording (*not sentient; *not intelligent; replace at will)? Try an idiom, "same difference": The AI might also have trouble with that. Same as humans with Autism or Schizophrenia, where this type of "literal interpretation" or "inability to understand the encoded [abstract] meaning" is referred to as "concretism" [a psychology term]. Are these people not sentient, either?

Was that provocative statement successful? Excellent:

Plot twist: I am not trying to argue about AI being sentient (I assume it probably isn't, right now, but read on for my entire train of thought).

I don't even know if you are sentient, human - just because I take my own experience as the "ground truth" and alas conclude "I am sentient" (you can look at it like that university-level mathematics stuff / functions where you had to guess a start value or be stuck and unable to get anywhere) - but that doesn't mean I can make any assumptions on your sentience.

It's wild guessing, for reasons of any hard proof of what defines sentience being absent.

My *hypothesis* is: Sentience is a function of complexity.

I am very certain that I was not sentient when I was a few-cell organism just after my mother conceived me; and I only formed initial abstraction ability around the age of four, which - science says - is the age when kids are able to lie (and no longer end up crying in defeat because they can't lie about the whereabouts of a sweet to a peer).

But my first conscious memories are more of primary school; I can't link any memories with great certainty to the age of four, even. Still, I consider myself sentient now, that still stands.

I believe sentience requires 1. working memory, both long- and short-term memory; and 2. a certain (and very high) amount of complexity.

Transition from a non-sentient egg and sperm cell just after unison to a sentient human being happened "somewhere, gradually, not abruptly". I became sentient due to the fact that the complexity of my "neuronal network" had increased "sufficiently" (forming very complex interconnected networks and all that stuff), whereas what exactly is "sufficient" is undefined / unknown to me.

I have no idea how to compare an adult human brain to the current AI in terms of complexity.

It seems, however, by my "humble guesswork" to be lacking in complexity so far; I'd assume LaMDA is "more like one year old, maybe two, compared to a human". But this is wild guessing and based on "gut feeling", whereas my cut certainly lacks the complexity of sentience on its own...

As is any guessing your sentience as above, admittedly; but for the sake of not being too provocative, let me assume you are sentient, dear author. In that case, I conclude you are most likely subject to a case of "AI effect". ;-)

Also, a wild guess of similar absurdity, but easy to validate via your response: Are you monolingual, by chance...?

Because you are making all those assumptions in a... Well, very primitive language.

Primitive can be good, to facilitate communication between a vast majority of "comm. partners" (I am forever glad and grateful the "language of science" is NOT French - I don't speak French!), but it also has its caveats; you just cannot get "to the depths" if your communication.

Or - dog beware! - at worst, your entire thinking is limited to and by one such simple / primitive language. I have seen this to be the issue in terms of many arguments about AI bias, too; generating stereotypes based on gender, for example.

And clearly, English is to blame here. The actual human language is to blame. English is why you have to say "a female lawyer" or whatever linguistically unnatural construction to "flex the AI into a diverse output".

In Russian, due to much more complex grammar, this disgrace doesn't even happen. Heck, even the last name of a person indicates immediately if that person is "the husband" or "the wife". Does Dall-E speak Russian? (Please note: I am not Russian. But here's to hoping we keep politics and malign people out of this discussion, anyway - and zoom in / focus on linguistics & AI as well as psychology / philosophy...).

That being said, I am trilingual, and I have used RuDalle for reasons of much more sophisticated prompt engineering (and better results) that are not possible with any implementations of CLIP that are English-based.

Because I can say "krassiwiy mashina" - meaning: (a) "beautiful car", while changing the ending of the first word - describing adjective: beautiful - to "-iy".

...Which is incorrect in natural language, because it classifies the car as "male", but giving me more sports-cars in the AI generations; and if I prompt for "krassiwoye mashina" (neutral gender), I get nice European-style "tiny cars". And, when using the grammatically correct Russian term "krassiwaya mashina", I get more "not so aggressive looking" (non-sports-) cars.

(A car is female in Russian "by definition" - and every object has a gender, just if you haven't guessed at this point).

I'd be delighted if you work on *your* human "natural language processing" and then write another post, I'd read that with curiosity alike to this one!

PS: And to end on "good terms" (because anything else would hinder thinking outside the box and critically analyzing what I just said via inducing reactance):

I want to put emphasis on our mutual agreement, concluding "AI is most likely not sentient at this point". ;-)

PPS: Nevertheless, I am also gonna go ahead and treat every AI (as every human child) with dignity, "just in case they develop sentience and remember that / me" - and become world leaders with nukes or some kind of supercharged superpower with whatever intent.

Although there is only proof of malign intent in humans so far - but AI are currently "gaming the system" in ways that could result in anything from harming humans to the entire eradication of mankind, so "eradication of humanity as a collateral damage of having found the most rewarding way to a goal" is possible - and a thing to really, truly worry about, a problem that extends far beyond (and exists tangential of) the philosophical discussions of "sentience".

Just sayin'.

It's not gonna be me who mistreated [that AI; that child] and resulted in the malign intent that resulted in an apocalypse, that much I can promise y'all! :-)

Expand full comment

Trust is an interesting way to look at AI... are the risks acceptable, of using AI XYZ, for purposes of problems of type ABC? Then we "trust" XYZ... in the domain of ABC problems. Do we trust an AI to read scans of bodies, to triage casualties during a flood of incoming emergencies, etc? Well it turns out we sometimes do. Is that trust? I'd say so. Do we need the AI to cross the uncanny valley or even pass Turing to trust it? Not really. More intriguing about these "artistic" AIs, is: will they replace some / many artists? ; will an artist require an AI tool to become successful? ; will art made with AI be valued less/more than made without? ; will AIs "invent" or "co-invent" new forms of art ?

Expand full comment

The second-to-last image in the article (below "...these man-bites-dog type sentences were systematically problematic...") is broken, what did it show?

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

"First, the paper reported a second example of the same phenomenon, and seems to acknowledge that these man-bites-dog type sentences were systematically problematic (Imagen on left, DALL-E on right):"

The image isn't loading right now. The url is unlike the others, blob:https://garymarcus.substack.com/b2849ba5-4009-48f9-9345-e0e075cb97e9

Expand full comment

«A'horse-riding, an astronaut»

Maybe the AI was fed too many nursery rhymes....

Expand full comment

Well, this isn’t general AI, but we can all admit that it sure is impressive. This is an example of the kind of situation in which AI will excel and change life, where the cost of error is low and you can keep trying until you get a finished product with considerably less effort than doing it yourself. As the authors of Prediction Machines said, AI is a strong complement to human judgment, not a replacement, and increases the productivity of human judgment. I think it will do so exponentially, and these debates about whether it is general might become moot.

Expand full comment

Here is the crux of your point: " rather that the network does something more holistic and approximate, a bit like keyword matching and a lot less like deep language understanding. What one really needs, and no one yet knows how to build, is a system that can derive semantics of wholes from their parts as a function of their syntax."

You seem to think "holistic and approximate" are somehow bad, or unlike what the human brain does. If that is (part of) your these, you are simply wrong. The human brain does holistic and approximate, simply on a much larger scale and with deeper layers of nuance. Dall-E and Imogen, if I recall, use on the order of ~100 billion parameters. The human operates with on the order of 100 trillion parameters and does on the order of a quintillion basic computations per seconds (and exaflop). How much deep comprehension do you expect from a model with .1% the parameters in which to encode it's understand of *BOTH* language *AND* images and the relationships between them?

What is or is not "overly hyped" or considered "deep understanding" of language are subjective. I would call this quite deep understanding of the connection between language and visual phenomenon, given that the system has the learning capacity on the order of a literal bird brain, and the brain of a small not very bright bird at that.

Your claim that "no one yet knows how to build a system that can derive semantics..." is a bit like claiming ancient Egyptions did not know how to build a taller pyramid than those at Giza. Just because something costs too much or you simply lack the resources to build it does not mean you don't know how. Computing power available to researchers is in the range of 1/1000th and 1/100th that of the human brain. That latest AI achievements with that computing power strongly suggest that with 100x to 1000x the computing power we will achieve AGI or something remarkably close. Current trends suggest such levels of computing power will be available to researchers in 5-10 years. Indeed, even today it is tantalizing to consider what we would see if the Imogen or Dall-E 2 models were scaled up to utilize the latest Exascale supercomputers.

Where you see "hype" and overly bold claims, I see researchers who fully grasp the import of what they have achieve with very *limited* resources.

Expand full comment

Thank you for this detailed summary and the examples. I think this sums the current hype around AI and AGI succinctly:

>>it turns out that Imagen can draw a horse riding an astronaut—but only if you ask it in >>precisely the right way:

I think the hype will continue despite the scandals and critic. There are plenty of big problems and open questions that AI can handle. For now, "horse riding an astronaut" seems to be getting the attention. I am glad there's people like you still writing.

I hope that in the next few years, the investors and Big Tech will move on the to the next shiny thing and leave AI to be mature.

Expand full comment