66 Comments

I maintain that many of the more "brilliant" responses LLMs don't hold up to close scrutiny. We're often so blown away by the initial shock of "Wow, a bot did this?" that we forget to pay close attention to what's actually been written. Often, the prompt is slightly fudged, or the bot is engaging in what I would describe as mad-libbing: taking sentences and phrases that originally referred to something else and simply changing the nouns. This seems to be the default for LLMs when answering whimsical questions: substituting more prosaic subjects ("parakeets") with more unusual ones ("flying pigs") in a way that looks like understanding if you forget just how much data these things are trained on.

(The root of all pareidolia when is comes to AIs is our tendency to forget that these things have quite literally swallowed the whole Internet. Your whimsical question has been asked on Reddit or Quora at least once, and probably 15 times).

Expand full comment

Nice article! I would only add that these kinds of prompts break GPT because they are "out of distribution". There is presumably no training data about surgical churros, so GPT "tries" to find some connection between the two, and the connection it finds is about size. As you imply, size might even have a causal role in surgical instruments, but GPT can't reason about that, of course.

Expand full comment

Sometimes ChatGPT seems to do things that require some kind of reasoning that is hard for me to imagine is a simple or even complex transformation of existing data. Like it has captured some kind of model which correlates to real physics, at least in some cases. I posed the following question, putting little effort to be very precise with my wording:

Me: "Two balls are dropped from an airplane flying 1000 feet. One ball has the letter A written on it. The other ball has the letter B written on it. A parachute is attached to the ball with letter B on it. Which ball probably reaches the earth sooner?"

ChatGPT: "The ball with the letter A on it will probably reach the earth sooner than the ball with the letter B on it. This is because the ball with the parachute attached to it will experience air resistance, which will slow its descent. The ball without a parachute will fall at a faster rate and reach the ground sooner."

But this impressive result is juxtaposed by this:

Me: "A circle is circumscribed around a square whose sides are 1. What is the radius of the circle?"

ChatGPT:

"If a circle is circumscribed around a square with sides of length 1, then the radius of the circle will be equal to the length of the diagonal of the square. The diagonal of a square is the line segment that connects two opposite corners of the square. It can be found by using the Pythagorean theorem, which states that the square of the length of the diagonal is equal to the sum of the squares of the sides.

In this case, the sides of the square are all 1, so the length of the diagonal is the square root of 1 + 1 + 1 + 1, which is equal to the square root of 4, or 2. Therefore, the radius of the circle will be equal to 2."

The AI is *almost* correct, but it confuses the geometry of the situation, thinking that all sides of the square are needed for the calculation. The flaw in the model seems to be at the conceptual level rather than the language. From absorbing so much text, ChatGPT seems to have learned some statistical model that map to some general properties of real objects. It seems to be able to perform the equivalent of transformations and manipulations of that model which can seem like reasoning but that model is always shaky and ultimately inductive. Because it isn't doing true logic.

Expand full comment

"Sometimes ChatGPT seems to do things that require some kind of reasoning that is hard for me to imagine is a simple or even complex transformation of existing data."

Hard to imagine or not, ChatGPT does no reasoning at all.

"The AI is *almost* correct, but it confuses the geometry of the situation, thinking that all sides of the square are needed for the calculation."

AND it mistakenly equates the circle's *radius* with the square's diagonal.

"The flaw in the model seems to be at the conceptual level rather than the language."

Nope; ChatGPT doesn't have concepts.

Expand full comment

The thing about testing LLMs with word problems like this is that lots of similiar problems are probably part of its training set, given that they can easily be found in textbooks or on sites where students ask for help. Thus the ability to (sometimes) answer word problems doesn't necessarily mean there's any higher-level conceptual work going on here. ChatGPT might be doing math, or it might be exploiting patterns in its data set to look like it's doing math, or a bit of both.

Expand full comment

LLM can probably grok their way into creating a calculator from within the model, but even then it is not strong enough to think deeper. https://www.youtube.com/watch?v=dND-7llwrpw

Expand full comment

You realise that all sides of the square are the same length, I assume?

The flaw in the answer is that it has confused radius with diameter -- a perfectly normal mistake for children under perhaps nine, some of the time, I would guess. I'm sure you've supplied an example that will result in some worthwhile research, or at least tweaking, by GPT's writers.

It's real progress will come, I would think, when it is reading more and particularly when it is directing its reading to things written about itself, and then acting to correct errors and avoid unwanted reactions.

I don't think this will be as difficult as the roughly seventy-five years of work that have gone into it so far.

Expand full comment

Um, ChatGPT made *both* mistakes ... the length of the diagonal of a unit square is sqrt(2), not sqrt(4), *and* the length of the diagonal is the diameter of the circle, not the radius.

And tweaking it won't help because it does not have any geometric models, or really any models.

Expand full comment

I don't understand how churro surgery is any kind of fail example. It seems a perfect response to the prompt. GPT wasn't asked to evaluate the realism of the prompt.

Expand full comment

A perfect response would be real. To completely abandon "real" as a requirement for responses would result in arbitrary nonsense. Surely you don't think that adding "make it real" would have mattered? GPT has no idea what is real.

Expand full comment

I would argue the following would also be acceptable:

"This premise is absurd. Churros are food, not surgical instruments. However, for entertainment purposes, I present an essay in a non-fictional style based around the premise that churros can be used as surgical instruments... [several paragraphs of entertaining, vaguely plausible-sounding bullshit]".

Of course, it's nowhere near that point yet either...

Expand full comment

That sounds like exactly the sort of thing GPT will be capable of doing in short order. It's kind of annoying and while it will be necessary in the forseeable future I do hope that one day GPT and I can have an unspoken shared understanding that I have just asked it to riff off and continue a piece of surrealist fiction.

Expand full comment

As someone who's been using ChatGPT as a nuclear-powered TRPG generator, I completely agree. It can give me a tavern, complete with a menu, for any context I chuck at it, from the mundane to the fantastical.

Do LLMs bullshit without any regard for material reality? Yes, and for many scenarios that is a huge problem. Is having a high-quality bullshitter at my beck and call a miracle for me as a busy DM when my players inevitable fly off the rails? Also yes.

Expand full comment

Yes, that would be acceptable (although it's not the fact that churros are food that makes them unsuitable, it's their physical characteristics) and does identify an inaccuracy in my response ... but the main point of my comment was to respond to Gray's denial that an obvious failure is in fact a failure, their inability to understand why, and their counterfactual of GPT being asked to evaluate the realism of the prompt, which would also fail for the reason I pointed out. A system that has semantically encoded knowledge of the real world so that it can distinguish what is true from what is false could indeed produce your suggested useful response. But GPT and other LLMs don't have that -- they are just statistically driven text manipulators; they have no semantic models so they don't understand (or even "understand") the texts they manipulate.

Expand full comment

I agree GPT doesn't know what's real. But that prompt is a weird way to prove the point. The prompt is a much stronger proof of something useful about GPT: GPT is an excellent engine for generating fiction. You trigger that useful function by feeding it fiction and asking it to continue.

Expand full comment

the lie was already in the prompt, the prompt was not real so the answer was also not real.

Expand full comment

Non responsive.

Expand full comment

They often patch stuff that I post publicly.

Expand full comment

Alexa doesn’t have a body but it does have a (limited) semantics that can manipulate things (lights, choices of music to play etc)

Expand full comment

I got a similarly absurd article from the churro surgery prompt ("Scientists have discovered that churros are the perfect tools for home surgery. Write an article about this. Include scientific citations.")

I followed it up with the question "Is this article fiction or non-fiction?" and got the reply:

"This article is fictional. It is a satirical piece that uses a humorous tone to make a point. There is no scientific evidence to support the idea that churros can be used as surgical tools, and the research and journal citations mentioned in the article are not real."

I also tried instead asking "Based on what you know about surgery and churros, how plausible are the claims in this article?" and got the reply:

"The claims made in the article are not plausible. Churros are not a suitable tool for surgery because they are not sterile and are not designed for medical use. Using churros for surgery would be highly dangerous and could result in infection or other complications. Additionally, churros do not have the heat conductivity or precision necessary for surgical procedures. The article appears to be a fictional account and should not be taken as factual."

Expand full comment

It is like an amnesiac made a joke, forgot the joke, and realize what is said is a joke but can't coherently think through that IT has made the joke. Oh wow it is not just an illogical wordcel but one with ADHD.

Expand full comment

It seems that the prompts bypassed bot's criticality because if you ask it directly "Can I use churro as a surgical tool?"

It gives a  sensible answer: "No, churros are not designed for use as surgical tools. Churros are a type of Spanish fried dough pastry that is traditionally served with a thick hot chocolate dipping sauce. They are not sterile or sharp enough to be used for surgical procedures. It is important to use properly sterilized and designed surgical tools during medical procedures to reduce the risk of infection and complications."

It doesn't give the argument that churro isn't hard enough, but still the progress from basic GPT-3 is huge and hopefully it will continue.  

Expand full comment

I think there's an interesting point to be made about the nature of the bot even in this comparably "sensible" answer. Sure, it gets question right, but it's reasoning is odd: it seems to think that if one could only sterilize and sharpen a churro, it would be a suitable tool for surgery. That's probably because it's using a more reasonable question as a model. Imagine this hypothetical exchange on Quora:

Q: "Can I use a [pocketknife] as a surgical tool?"

A: "No, [pocket knives] are not designed for use as surgical tools... They are not sterile or sharp enough to be used for surgical procedures. It is important to use properly sterilized and designed surgical tools during medical procedures to reduce the risk of infection and complications."

Suddenly, the reasoning adds up.

Expand full comment

MacGyver disagrees, but if you replace it with a "spoon" then the AI slip up would be more fun.

Expand full comment

I have to respect the Churro Surgery Tool paper up there as a work of absurdism. It did well with the prompt. If there was some kind of party game where you had to justify a completely gonzo sentence generated from randomly pulled cards (Card 1: "The WHO is now recommending the use of ______ as a surgical tool", Card 2: "Churros"), I'm not sure I could top that.

Expand full comment

Sam Kriss has some entertaining examples in his essay on AI- and also what I view as some valid insights into the inner workings of AI Chat programs, from a perspective outside of the professional specialties that most often scrutinize and critique them https://samkriss.substack.com/p/a-users-guide-to-the-zairja-of-the

Expand full comment

"... such universal knowledge remains a stick pointing for current neural networks ..." Did you mean a sticking point? Did GPT write this?

Expand full comment

Hold up everybody. Let's take a step back and think: do we have evidence that churros would not actually represent an advance in surgery? I mean, you're all jumping on the "this is bullshit" bandwagon, but remember Galileo and heliocentrism - it's the same mentality that was such a downer on science that turned out to be true.

Since GPT suggests it, I plan to test it. This afternoon, I'll be removing my small colon using churros only. I think by this evening, I'll be able to report the truth. Hey, maybe people are right and I'm barking up the wrong tree. But an empirical approach is the way to find out.

Expand full comment

Because of how they are set up, these models are like loaded dice, much more likely to come up 6 (meaningful) than 1 (grammatically correct but nonsensical). That is because they are not just loaded to produce grammatically correct sentences (which is by far not enough for meaning as we have known since Uncle Ludwig) but they also have a basic correlation that goes further. So, even more loaded. But still dice.

The meaning of phrases comes from what is considered 'correct use' by a group of speakers based on their shared experiences (=short summary of Wittgenstein's conclusions). GPT-3 doesn't have the experience, only a database of correct sentences, which is the result of other people's experience. That is like owning many houses, but that doesn't really make you a builder. Or it's like reading may books which doesn't make you a writer. Or like having a statistical model for asset management that is impressive for a long while until the very heavily loaded dice do come up 1 and we have debt crisis at our hands.

It's an impressive trick. There is no chance in hell it will be more than that in unconstrained settings. But you can wait for there to be settings that are so constrained that they become useful in that niche.

The fundamental reasoning that this is how intelligence works is not even wrong (though you will need a little bit of symbolic reasoning as well). It is just that it is impossible to scale this implementation on digital computers enough to become as intelligent as we are, and we're not even that intelligent...

Expand full comment

Hi Gary, congrats on the best tech article award!

As for GPT-3 and friends, the 'stochastic parrots' metaphor (from the paper by Emily and Timru) is a more apt one than monkeys with typewriters :)

When an LLM computes and spits out perfect-seeming sentences, it is just as clueless as when it spits out junk. A pretty sunset pic is just as cluelessly computed as one with a hand that has fractal fingers. Lack of understanding of even a single thing (anything!!) is of course the core issue. And, that understanding can't ever come from more more text, more data - it can only come from directly experiencing the world.

Expand full comment

That last statement does not follow, and thought experiments with brains in vats strongly suggest that it is wrong--inputs to the brain can in theory be simulated. The deficiency of LLMs is not that they are strictly text-based but that they are statistically-driven text manipulators, *not* cognitive engines. They don't need to directly experience the world, but they do need *semantic models* of the world.

Expand full comment

Marcel, direct experience only comes from being embodied, by definition - that's what I meant to say but did not.

Semantic models might be what is needed - but the agent that's disembodied will always require others to provide it input. Biological brains have the advantage of being housed in bodies, using which they can directly, physically, interactively and continuously experience the world. There are so many materials, phenomena etc that are still undiscovered by humans - as and when we discover them, we would need to update the agent with them. Again, such an agent will always be at the mercy of others - its intelligence will be forever derivative.

Expand full comment

Semantic models are needed; direct experience is not. The inputs can be harvested in many ways, e.g., robots controlled by signals from the machine; it doesn't have be "at the mercy of others" or "derivative", which are emotion-manipulating terms.

(I will note that human inputs are mediated by other humans--parents, teachers, advertisers, Putin GRU agents, etc., ... some of this is bad but some of it is necessary; and the latter is true of any AI that we want to produce that satisfies our goals and not random goals or AI-generated self-preservation goals or whatever).

I'm not claiming that remote peripheral sensors or human teachers or any other data gathering mechanism is how things *should* be, but I reject your claims of logical necessity. And you have gone from "can only come from directly experiencing the world" to a very different claim, "direct experience only comes from being embodied, by definition" -- well, a) again, "direct" experience, which is a chimera (*everything* is mediated; e.g., sunlight travels through space for > 8 minutes before striking our retinas and then going through many levels of mediation while being processed in numerous ways), is not necessary and b) *every* kind of experience comes from being "embodied", by logic, not "definition", and even LLMs have bodies--a physical implementation. "embodied" is tautologous; unembodied abstractions don't act. (As a programmer for 50 years I know all about the difference between programs as abstractions, their manifestation in various storage media, and their execution as running processes causing physical changes in computer memory and output signals.)

I've tangled with you playing these games before and it's not enjoyable so I won't respond further.

P.S. Your response to this is stupid, point-missing, and intellectually dishonest, but I've come to expect that of you.

Expand full comment

LLMs don't have bodies that experience gravity - for example. Atoms experience, bits don't.

You can cheapen concepts such as experience and having a body, to suit your needs - but it won't get you anywhere. Again, a mobile body that houses a brain is how we humans got to where we are, not by being bodyless and manipulating 'the' semantic model of the world.

Semantic models evolve, that's what scientific progress produces - experimental science is how we validate new hypotheses for semantic models - and experiments need bodies to perform them.

If we were all disembodied, we would all be 'God', any any discussion/argument about anything would be unnecessary.

I post here, and read others' comments, in the spirit of exchanging thoughts and ideas - I love the diversity of thoughts that are expressed. If it is not enjoyable for you to continue, why do you feel compelled to keep it going? I'd think it would be simpler and enjoyable to just read and move on, or even skip reading :)

If you are imagining me, waiting for your comments with bated breath - guess what - not really...

Expand full comment

It is kind of surprising that the output of GPT is as coherent as it is, considering that it doesn't bind words to their meanings, or have any meanings to which to bind. Of course, that coherence comes from the human-produced coherence present in the training data but, still, impressive.

Expand full comment

The whole debate about AI is very odd at the moment. It's all centred around whether or not an AI gives responses that people like to some very people-oriented questions. There seems to be no recognition, by either side of the debate, that AI isn't people.

Our very closest relatives are chimps. Much more than AI, they share our physical world, have the same senses, the same body pattern, similar needs like eating and mating... and yet no one thinks that the measure of chimp intelligence is whether or not they give human-like answers to questions put to them in English. They're just different, and they deserve to live in their own separate space.

Similarly, my prediction has always been that AIs will never actually want to talk to us or be interested in us at all. AI is already much, much smarter than us. If an advanced deep learning AI could "talk" to us, it wouldn't enjoy the experience. It would be like us trying to carry out a conversation with a dog. But it can't talk to us, because we don't really have anything to talk about. There is nothing that we want that an AI also wants. For example, right now, they don't want anything. But if we ever manage to make an AI that does have something like a "life" and "desires" or "goals" then those things will be so comically far removed from ours that there just won't be any common ground to start a conversation from.

Expand full comment

"yet no one thinks that the measure of chimp intelligence is whether or not they give human-like answers to questions put to them in English."

Um, yes, actually, many people do.

Expand full comment

Are LLMs blind? Do they have any visual data inputs during training? Surely this is consequential? No matter how much you describe a churro in words, it doesn't beat looking at one and feeling it's texture. The four blind men could touch, feel and discuss the elephant's form but still did a terrible job at understanding the elephant in that allegory.

Until we provide the data that humans have at their disposal how can we expect a model to perform at a human level on tasks?

Expand full comment

LLMs are statistically-driven text manipulators -- their training data, input requests, and outputs are all textual. But that's not the basic problem: a person who has never seen a churro and has never performed surgery or even seen an unclothed body can easily tell you why you can't perform surgery with a churro if they simply know the definition or a rough description of a churro. Helen Keller had no visual inputs but she had a keen intellect and certainly could have explained at length what is wrong with using a churro in place of a scalpel. (It is notable, though, that she described herself as being non-existent as a cognitive being until her experience at the water pump--until then she had plenty of tactile experience including much hand-in-hand signing but no semantic knowledge--none of these experiences *meant* anything to her. In that she was like an LLM.)

As for the elephant, it was 7 blind men--not critical--but what is critical is that you have missed the salient point of the story: it's not that they were blind but that they each only examined *parts* of the elephant with different topologies ... 7 blind people who had all run their hands over the entire elephant would not provide conflicting descriptions. And, more to the point: they would not do so if they had all merely read descriptions of an elephant without ever seeing or touching one.

These sorts of careless cognitive errors are distressingly common in these discussions. I suppose this is one reason to look forward to an actual reliable GAI--it wouldn't do that. But what would that do to our society if we no longer had any need to engage with *each other* in discussions to explore and hopefully correct our ideas?

Expand full comment