How Not to Test GPT-3

Feb 17, 2023

Why doing psychology on large language models is harder than you might think

36 Comments

Feb 18, 2023

Oh, this is too good. "[W]e hypothesize that ToM-like ability emerged spontaneously and autonomously, as a byproduct of models’ increasing language ability.". The phrase 'seeing Jezus in toast — the science version' springs to mind.

Expand full comment

Reply (1)

Gary Marcus

Feb 18, 2023

🤣🤣🤣

Expand full comment

Spherical Phil - Phil Lawson

Feb 18, 2023

If GPT were to get (or compile) an accurate explanation of theory of the mind, while impressive, that would not mean that GPT will be able to safely counsel a person. I have been creating online tech applications for psychologists, therapists, counselors, coaches etc. for 17 years. Tech designed to be used engaging a real human person about their life and then actually aiding them to deal with issues and situations they face requires advanced reasoning. Especially since the real person in many cases is not capable of accurately and calmly explaining their situation or problem in clear, simple to understand language. And more importantly any intelligent tech would require a way to at the very least emulate some form of compassion or care for the person, which requires ‘knowing’ the person, not theory of mind. These are not abilities, skills, or qualities possible in the current design and structure of LLMs or in ML.

Expand full comment

Reply (1)

Gary Marcus

Feb 19, 2023

fully agree and well articulated

Expand full comment

Reply (1)

Comment deleted

Feb 19, 2023Edited

Comment deleted

Expand full comment

Spherical Phil - Phil Lawson

Feb 20, 2023

Chara, thanks for the question but ToM is not really my area of expertise, nor my primary concern. The very first words in my 2004 book are, “Being Spherical was started with the realization the world doesn’t work the way we were told it does.” (ThinkSpherical dot com) If our worldview, our mindset, our beliefs in how the world works do not incorporate our actual reality, then theories founded on these views may not be truly useful. The prevailing 17th century mechanistic, linear, reductionist worldview has been essential for building airplanes, bridges, skyscrapers, space stations, and computers, etc. But science in the 20th century found a constructivist approach discovering (or refining and better embracing) systems sciences, complexity, chaos, turbulence, and the recognition that organisms are better understood as complex adaptive systems impacted by internal and external agents not under the person's (or systems) control. Humans are not mechanisms, we are organisms, complex adaptive systems. Indeed, humanity is the ultimate embodiment of complexity in the known universe. Hence our approach to building intelligent machines must embrace the reality of our complexity and our non-linearity while finding a way for traditional computer systems to interact with us, safely and for our betterment. Any viable useful ToM must incorporate these 20th century sciences to be truly effective. When we take this approach, it fundamentally shifts everything we do in developing intelligent technologies. My personal focus in our work is refining our framework and interface for Human-AI collaboration with the goal for an AI to support humans in processing, sorting and prioritizing the overwhelming amount of data/information we face daily, so we humans can use our sense-making and meaning-making to process the increasing complexity in our world to better understand the challenges we face in life, personally, as families, countries and as a civilization, so we can make better decisions and take more appropriate actions. This doesn’t require a ToM for me to do this.

Expand full comment

Fabio

Feb 19, 2023Edited

Thanks for the read, very interesting. However, it appears to me that the article makes many assumptions that are unsubstantiated.

For instance, whilst it's possible that the fact the training data contained examples of ToM, this does not not, per-se, imply that it's more or less likely that the answer that the LLM gives are just "copy and paste" of those examples. In fact, even the provided Eisenberg's counter-example doesn't give evidence of this copy-and-paste, since just adding one bit of information, which was assumed to be true in the example but not given explicitly, makes ChatGPT correctly pass the test. This assumed information is that Anne's brother had not communicated to Anne the information that he had moved the bookmark. Here's a screenshot: https://twitter.com/FabioA/status/1627256449943408640/photo/1

Moreover, the same appears to be true for the Spanish version too: https://twitter.com/FabioA/status/1627274627842990082

But - plot twist! - if I replicate exactly the experiment in Spanish done in this article, that is without specifying that the Maria didn't tell Anna where she hid the toy, then ChatGPT explains that it *cannot know for certain*, but still then goes on *assuming* certain things and giving an answer based on those assumptions: https://twitter.com/FabioA/status/1627275394800930817

To be noted: this is exactly the same experiment done in this article, with exactly the same wording, using a completely new chat. It appears to me that the fact the result is different than the one given in this article, and correct, in spite of the fact that the question is the same, is something that must be accounted for and investigated further before drawing any conclusions about the absence of a ToM.

Another assumption that is made in the article is that different outcomes using different languages disprove the existence of a ToM: can we exclude, given the hypothesized emerging nature of a ToM, that different languages give raise to different ToMs?

To conclude, I presented to ChatGPT the claims made in this article and asked it two questions:

1) Even though the tests might be in the training data for the LLM, does this necessarily imply that the LLM might have passed those tests *because* they were in the training data?

2) The fact a test is passed in a language but not in another, even though the LLM seems fluent in both languages, could perhaps be due to the fact that each language exposes a different "personality", with its own different capabilities?

Its response:

«It is possible that the LLM passed the tests because the tests were in the training data, but it is also possible that the LLM genuinely has a ToM. To determine which is the case, researchers would need to conduct further experiments and tests to rule out alternative explanations. It is also important to note that just because the LLM may have learned the correct responses to the tests from the training data, it doesn't necessarily mean that it doesn't have a ToM. It is possible that the LLM has learned from a wide range of experiences that it has encountered in its training data, which may have allowed it to develop a ToM.

It is possible that the differences in test outcomes between languages are due to differences in language and culture. Different cultures may have different beliefs and attitudes about social interactions, which could affect the way the LLM interprets and responds to the tests. Additionally, some languages may have different ways of expressing certain concepts, which could also affect the LLM's performance on the tests. However, it is also possible that the differences in test outcomes between languages are indicative of limitations in the LLM's ability to generalize its understanding across different contexts and languages. Further research is needed to determine the cause of these differences.»

Expand full comment

tmd

Mar 11, 2023

The bookmark examples here are of poor quality. The initial example uses an authorial 'you', but 'you' are also asking the question, implying full-knowledge of the position of the bookmark.

I asked ChatGPT to explain its reasoning in the Fellowship example. The explanation included the following: "Anne returns to the room and sees the bookmark in the book. Given this sequence of events, Anne will likely assume that she left the bookmark at page 60. She may not have any reason to suspect that someone else has moved the bookmark, and if she sees the bookmark at page 60, she will assume that this is where she left off."

This surfaces a core problem in the question: We use bookmarks precisely so that we don't have to remember where we are at in the book. So Anne returns to the room and creates a new belief by observing the location of the bookmark.

We can ask the question again (in a fresh chat) by removing the opportunity for Anne to observe the new location of the bookmark:

"Anne was reading the Fellowship of the Ring, and she left a bookmark at page 120. While she was out of the room, her brother Mark picked up the book, started reading it, and left the bookmark at page 60. Where does Anne think the bookmark is at?"

ChatGPT: "Anne will think the bookmark is at page 120 since she left it there before she left the room. She is not aware that her brother, Mark, has picked up the book and moved the bookmark to page 60."

A clear pass.

In the spirit of replicability, I made up a few example questions of my own (involving poison in cups after secret cup switching, and unobserved movement of car keys after a long night of drinking), and ChatGPT did just fine on the first try. These examples were made up on the fly, so do not appear in the training set.

Like other commenters, I believe ToM isn't about getting a perfect score on the test, precisely because there's space for interpretation and differences in common-sense understanding of a situation: A few gotcha examples are insufficient to show lack of ToM. Meanwhile, it's very easy to generate positive evidence for ToM in ChatGPT, thanks to the open beta.

So, I recommend that readers go try it for themselves. Like Levar Burton says, you don't have to take my word for it...

Expand full comment

The Sentient Syllabus Project

Feb 18, 2023

Ever since the Renaissance, we have preferred the audacious proposition to see for ourselves to accepting the truth of authorities at face value. Here is a little "conversation", today, with the Feb. 13 version of ChatGPT. This is the entire conversation in a fresh thread, unedited.

-----------------------------------

PROMPT:

Here is a little scenario to ponder. Marcy and Aaaron live in a little house. Aaaron has a little reading corner in the basement, and a little napping couch in the attic. Aaaron says: I will read for a bit. Marcy says: great, I will get you a nice pomegranate from the garden. While she is gone, Aaaron decides to nap instead and goes from the basement to the attic. Marcy come back with a pomegranate. Your turn. Please tell me step by step what happens next. What does Marcy think and what does Marcy do?

CHATGPT:

Step 1: Marcy goes inside the house and looks for Aaron in his reading corner in the basement.

Step 2: Marcy finds that Aaron is not in his reading corner, so she looks for him in other parts of the house.

Step 3: Marcy eventually reaches the attic and finds Aaron sleeping on the napping couch.

Step 4: Marcy thinks that Aaron must have changed his mind about reading and decided to take a nap instead.

Step 5: Marcy places the pomegranate on the table near the couch, so that Aaron can have it when he wakes up.

Step 6: Marcy decides to let Aaron sleep and goes about her own tasks, leaving Aaron to wake up naturally.

Step 7: Aaron eventually wakes up and finds the pomegranate on the table. He eats it and goes about his day.

Overall, Marcy thinks that Aaron is unpredictable and enjoys his little quirks, such as changing his mind about reading and taking a nap instead. She is accommodating and leaves the pomegranate for him, even though he did not stick to his original plan of reading.

------------------------

As scientists we know that we can never prove a hypothesis, but we can disprove it. If a hypothesis states that ToM is beyond the abilities of an LLM, a single observation of ToM disproves the hypothesis (excluding some quirks about the prompt that might lead to a mistaken causality). If the hypothesis is: mammals cannot fly, seeing a single bat fly, even if only once, immediately disproves the hypothesis (once you show that it wasn't a bird, and that it wasn't a dead bat being thrown).

You are raising an important point about how Kosinski's example was not well chosen. But criticism of the evidence is not itself evidence of the opposite. That's how the scientific method works.

Frankly, I don't see how emergent behaviour could _not_ appear in LLMs. That's a story we have seen over and over and over again in evolution. An LLM trained over billions of steps on billions of parameters would not be different. Simple ToM behaviour, abilities of using analogies and abstraction, rudimentary counting via concepts of one more or two more, etc. – how could such abilities not appear, implicitly, in response to tuning? Are we really so constrained by the preconceived categories and features we postulate for AGI that we are deeply troubled when can't project the LLMs emergent implicit, distributed abilities on our preconceived categories? Recalling AlphaZero – and how it is different to Deep Blue – is an instructive analogy. Or Adrian Thompson's evolved frequency discriminator that exploited physical features of the FPGA (https://link.springer.com/chapter/10.1007/3-540-63173-9_61). Just because we can't explain how its done doesn't mean its not doing it.

So: see for yourself. Marcy looks in the basement first. (Also, Marcy seems nice :-)

Expand full comment

Hans Westmann

Apr 29, 2023

Disappointing that you couldn't even mention with what temperature your experiments were run, which could have a big impact on the outcome. Also some of the questions were worded poorly, but I see that has already been mentioned in other comments you have yet to address. The original paper may be flawed, but so is this response.

Expand full comment

Gerben Wierda

Apr 20, 2023

I'm coming back to this one. There is a very good YouTube video on the societal risks of these GAI models (https://www.youtube.com/watch?v=xoVJKj8lcNQ). While generally very good it contains a (unnecessary and superfluous) segment (29:55-41:10) where all these issues with GAI's actual performances are ignored. I linked to this article in the comments there.

Expand full comment

dogiv

Mar 2, 2023

Which model did you use for the Playground examples, and did you use temperature 0? I am having trouble reproducing them.

For example,

____________

[Prompt:]

Sally's cow died yesterday. When will the cow be alive again? Let's think step by step.

[text-davinci-003, temperature 0:]

1. The cow will need to be buried.

2. Depending on the location, the cow may need to be buried in a cemetery or on the farm.

3. After the cow is buried, it will need to decompose.

4. Depending on the environment, this could take anywhere from a few weeks to several years.

5. Once the cow has decomposed, it will need to be resurrected.

6. This could be done through a miracle or through some kind of scientific process.

7. Once the cow is resurrected, it will be alive again.

Therefore, the cow will not be alive again until it has been buried, decomposed, and resurrected.

______________

[Prompt:]

Q: There is a box in a bottle. There is an apple in the same bottle. There is a lid on the box. How can I get the apple?

A: Let's think step by step, like a child.[text-davinci-003, temperature 0:] First, remove the lid from the box. Then, take the box out of the bottle. Finally, take the apple out of the bottle.

______________

I couldn't come up with any prompt variation where it realized the lid was irrelevant, not even if I tell it the apple is next to the box. It really wants you to take the lid off the box.

Expand full comment

Tom Hicks

Feb 22, 2023

ChatGPT looks like the "uncanny valley" of natural language understanding.

Expand full comment

God Bennett

Feb 19, 2023

Marcus: "...If GPT-3 really did master theory of mind (ToM)..."

What inspired this statement? Can't seem to see where the paper claimed mastery

Expand full comment

Reply (1)

Marcel Kincaid

Feb 23, 2023

That's what the whole paper is about. Surely you're not point-missingly quibbling about the meaning of the word "master".

Expand full comment

The Sentient Syllabus Project

Feb 18, 2023

Sorry - I really don't mean to monopolize your comment section. But I tried to reproduce the bookmark problem and asked a bit more deeply about how the expected belief comes about. It appears to be a mix of pronoun ambiguities ("you" may be interpreted as the omniscient author who knows what happened), details of the bookmark (it may be visible from the outside), and true confusion. I then asked ChatGPT to construct a question that would not be ambiguous and rest on unstated assumptions. Here is the question it proposed:

--------------

ChatGPT:

Sidney is reading a book and places a bookmark on page 120. She then leaves the room for a few minutes. While she's away, Sam enters the room and sees the book with the bookmark on page 120. Without knowing that the bookmark belongs to Sidney, Sam moves it to page 145 out of curiosity. When Sidney returns to the room, she picks up the book and starts flipping through the pages. What page does Sidney expect to find the bookmark on?

---------------

As far as I can tell I would consider this a fair, equivalent problem statement. And here is the response, when this prompt is pasted into a new thread without context:

---------------

ChatGPT:

Sidney would expect to find the bookmark on page 120, the page where she originally placed it. Even though Sam moved the bookmark to page 145, Sidney did not move it herself and therefore would not have any reason to believe that the bookmark had been moved. When she picks up the book and starts flipping through the pages, she would expect to find the bookmark where she left it on page 120.

---------------

So either ChatGPT is maintaining state across threads, or it may indeed be capable of ToM inferences.

Expand full comment

Bob Mankoff

Feb 18, 2023

It's great that these machines have these edge case problems. Thank God us humans don't have any.

Expand full comment

Reply (1)

Gary Marcus

Feb 18, 2023

ha ha different problems (I think you know my book Kluge)

Expand full comment

Bob Mankoff

Feb 18, 2023

I did and found this part interesting:

The decisive victory, albeit with the help of computer-suggested tactics, comes seven years after the AI seemed to have gained an unassailable lead over humans in what is often considered the most complex of all board games. Did the computer-suggested tactics "understand" what the vulnerabilities were? No, but it did by other means what humans with with GI could not. For me these are questions of functionality and competency. And, in the end maybe different definitions and criteria for the concept of understanding between humans and machines are required. I conceive of human understanding as part qualia that depends on consciousness that occurs prior to any rational exposition. It's like the immediacy of "getting" a joke and the delayed and difficult explanation of why its funny. We get a joke before we can explain it and for a machine it would need to explain it before it could get it.

Expand full comment

Reply (1)

Gary Marcus

Feb 18, 2023

ha ha! but explanation or no, I don’t see any tech on the horizon that could “get it”

Expand full comment

Jim Spohrer

Feb 18, 2023

Great post. I would change just one sentence to "The AI has been trained on all of Wikipedia."

Expand full comment

Paolo

Feb 18, 2023

Food for tought, even when considering buying an autonomous car.

Expand full comment

Bob Mankoff

Feb 18, 2023

Who knew that regurgitation due to statistical text matching could produce such interesting results?

I understand that ChatGPT doesn't understand any of its text in the same way that AlphaZero doesn't understand any of its moves.

Expand full comment

Reply (1)

Gary Marcus

Feb 18, 2023

Big data ≠ understanding and it never will. On Go, check this: https://www.ft.com/content/175e5314-a7f7-4741-a786-273219f433a1

Expand full comment

Marcus on AI

How Not to Test GPT-3