I maintain that many of the more "brilliant" responses LLMs don't hold up to close scrutiny. We're often so blown away by the initial shock of "Wow, a bot did this?" that we forget to pay close attention to what's actually been written. Often, the prompt is slightly fudged, or the bot is engaging in what I would describe as mad-libbing: taking sentences and phrases that originally referred to something else and simply changing the nouns. This seems to be the default for LLMs when answering whimsical questions: substituting more prosaic subjects ("parakeets") with more unusual ones ("flying pigs") in a way that looks like understanding if you forget just how much data these things are trained on.

(The root of all pareidolia when is comes to AIs is our tendency to forget that these things have quite literally swallowed the whole Internet. Your whimsical question has been asked on Reddit or Quora at least once, and probably 15 times).

Expand full comment
Dec 1, 2022Liked by Gary Marcus

Nice article! I would only add that these kinds of prompts break GPT because they are "out of distribution". There is presumably no training data about surgical churros, so GPT "tries" to find some connection between the two, and the connection it finds is about size. As you imply, size might even have a causal role in surgical instruments, but GPT can't reason about that, of course.

Expand full comment
Dec 9, 2022Liked by Gary Marcus

Sometimes ChatGPT seems to do things that require some kind of reasoning that is hard for me to imagine is a simple or even complex transformation of existing data. Like it has captured some kind of model which correlates to real physics, at least in some cases. I posed the following question, putting little effort to be very precise with my wording:

Me: "Two balls are dropped from an airplane flying 1000 feet. One ball has the letter A written on it. The other ball has the letter B written on it. A parachute is attached to the ball with letter B on it. Which ball probably reaches the earth sooner?"

ChatGPT: "The ball with the letter A on it will probably reach the earth sooner than the ball with the letter B on it. This is because the ball with the parachute attached to it will experience air resistance, which will slow its descent. The ball without a parachute will fall at a faster rate and reach the ground sooner."

But this impressive result is juxtaposed by this:

Me: "A circle is circumscribed around a square whose sides are 1. What is the radius of the circle?"


"If a circle is circumscribed around a square with sides of length 1, then the radius of the circle will be equal to the length of the diagonal of the square. The diagonal of a square is the line segment that connects two opposite corners of the square. It can be found by using the Pythagorean theorem, which states that the square of the length of the diagonal is equal to the sum of the squares of the sides.

In this case, the sides of the square are all 1, so the length of the diagonal is the square root of 1 + 1 + 1 + 1, which is equal to the square root of 4, or 2. Therefore, the radius of the circle will be equal to 2."

The AI is *almost* correct, but it confuses the geometry of the situation, thinking that all sides of the square are needed for the calculation. The flaw in the model seems to be at the conceptual level rather than the language. From absorbing so much text, ChatGPT seems to have learned some statistical model that map to some general properties of real objects. It seems to be able to perform the equivalent of transformations and manipulations of that model which can seem like reasoning but that model is always shaky and ultimately inductive. Because it isn't doing true logic.

Expand full comment

I don't understand how churro surgery is any kind of fail example. It seems a perfect response to the prompt. GPT wasn't asked to evaluate the realism of the prompt.

Expand full comment

They often patch stuff that I post publicly.

Expand full comment

Alexa doesn’t have a body but it does have a (limited) semantics that can manipulate things (lights, choices of music to play etc)

Expand full comment
Dec 9, 2022·edited Dec 9, 2022

I got a similarly absurd article from the churro surgery prompt ("Scientists have discovered that churros are the perfect tools for home surgery. Write an article about this. Include scientific citations.")

I followed it up with the question "Is this article fiction or non-fiction?" and got the reply:

"This article is fictional. It is a satirical piece that uses a humorous tone to make a point. There is no scientific evidence to support the idea that churros can be used as surgical tools, and the research and journal citations mentioned in the article are not real."

I also tried instead asking "Based on what you know about surgery and churros, how plausible are the claims in this article?" and got the reply:

"The claims made in the article are not plausible. Churros are not a suitable tool for surgery because they are not sterile and are not designed for medical use. Using churros for surgery would be highly dangerous and could result in infection or other complications. Additionally, churros do not have the heat conductivity or precision necessary for surgical procedures. The article appears to be a fictional account and should not be taken as factual."

Expand full comment

It seems that the prompts bypassed bot's criticality because if you ask it directly "Can I use churro as a surgical tool?"

It gives a  sensible answer: "No, churros are not designed for use as surgical tools. Churros are a type of Spanish fried dough pastry that is traditionally served with a thick hot chocolate dipping sauce. They are not sterile or sharp enough to be used for surgical procedures. It is important to use properly sterilized and designed surgical tools during medical procedures to reduce the risk of infection and complications."

It doesn't give the argument that churro isn't hard enough, but still the progress from basic GPT-3 is huge and hopefully it will continue.  

Expand full comment

I have to respect the Churro Surgery Tool paper up there as a work of absurdism. It did well with the prompt. If there was some kind of party game where you had to justify a completely gonzo sentence generated from randomly pulled cards (Card 1: "The WHO is now recommending the use of ______ as a surgical tool", Card 2: "Churros"), I'm not sure I could top that.

Expand full comment

"... such universal knowledge remains a stick pointing for current neural networks ..." Did you mean a sticking point? Did GPT write this?

Expand full comment

Hold up everybody. Let's take a step back and think: do we have evidence that churros would not actually represent an advance in surgery? I mean, you're all jumping on the "this is bullshit" bandwagon, but remember Galileo and heliocentrism - it's the same mentality that was such a downer on science that turned out to be true.

Since GPT suggests it, I plan to test it. This afternoon, I'll be removing my small colon using churros only. I think by this evening, I'll be able to report the truth. Hey, maybe people are right and I'm barking up the wrong tree. But an empirical approach is the way to find out.

Expand full comment

Because of how they are set up, these models are like loaded dice, much more likely to come up 6 (meaningful) than 1 (grammatically correct but nonsensical). That is because they are not just loaded to produce grammatically correct sentences (which is by far not enough for meaning as we have known since Uncle Ludwig) but they also have a basic correlation that goes further. So, even more loaded. But still dice.

The meaning of phrases comes from what is considered 'correct use' by a group of speakers based on their shared experiences (=short summary of Wittgenstein's conclusions). GPT-3 doesn't have the experience, only a database of correct sentences, which is the result of other people's experience. That is like owning many houses, but that doesn't really make you a builder. Or it's like reading may books which doesn't make you a writer. Or like having a statistical model for asset management that is impressive for a long while until the very heavily loaded dice do come up 1 and we have debt crisis at our hands.

It's an impressive trick. There is no chance in hell it will be more than that in unconstrained settings. But you can wait for there to be settings that are so constrained that they become useful in that niche.

The fundamental reasoning that this is how intelligence works is not even wrong (though you will need a little bit of symbolic reasoning as well). It is just that it is impossible to scale this implementation on digital computers enough to become as intelligent as we are, and we're not even that intelligent...

Expand full comment

Hi Gary, congrats on the best tech article award!

As for GPT-3 and friends, the 'stochastic parrots' metaphor (from the paper by Emily and Timru) is a more apt one than monkeys with typewriters :)

When an LLM computes and spits out perfect-seeming sentences, it is just as clueless as when it spits out junk. A pretty sunset pic is just as cluelessly computed as one with a hand that has fractal fingers. Lack of understanding of even a single thing (anything!!) is of course the core issue. And, that understanding can't ever come from more more text, more data - it can only come from directly experiencing the world.

Expand full comment

It is kind of surprising that the output of GPT is as coherent as it is, considering that it doesn't bind words to their meanings, or have any meanings to which to bind. Of course, that coherence comes from the human-produced coherence present in the training data but, still, impressive.

Expand full comment

The whole debate about AI is very odd at the moment. It's all centred around whether or not an AI gives responses that people like to some very people-oriented questions. There seems to be no recognition, by either side of the debate, that AI isn't people.

Our very closest relatives are chimps. Much more than AI, they share our physical world, have the same senses, the same body pattern, similar needs like eating and mating... and yet no one thinks that the measure of chimp intelligence is whether or not they give human-like answers to questions put to them in English. They're just different, and they deserve to live in their own separate space.

Similarly, my prediction has always been that AIs will never actually want to talk to us or be interested in us at all. AI is already much, much smarter than us. If an advanced deep learning AI could "talk" to us, it wouldn't enjoy the experience. It would be like us trying to carry out a conversation with a dog. But it can't talk to us, because we don't really have anything to talk about. There is nothing that we want that an AI also wants. For example, right now, they don't want anything. But if we ever manage to make an AI that does have something like a "life" and "desires" or "goals" then those things will be so comically far removed from ours that there just won't be any common ground to start a conversation from.

Expand full comment

Are LLMs blind? Do they have any visual data inputs during training? Surely this is consequential? No matter how much you describe a churro in words, it doesn't beat looking at one and feeling it's texture. The four blind men could touch, feel and discuss the elephant's form but still did a terrible job at understanding the elephant in that allegory.

Until we provide the data that humans have at their disposal how can we expect a model to perform at a human level on tasks?

Expand full comment