103 Comments
User's avatar
Lasagna's avatar

I work a lot with LLMs and this is a weird recurring problem I see. When you request a list of things - basically what you did here - if they get multiple items in that list wrong they often can never get it right. Try asking questions that requires Chat-GPT to come up with a list of Roman Emperors who did X.

Basically anything with even slightly grey borders seems to make it lose the plot. Ask it for 10 famous quotes from Roman Emperors. I always get Julius Caesar, or Cicero, or a pope or two in there. It’ll admit that it got it wrong and then give a revised list with quotes from Virgil, Tacitus and Mussolini.

Expand full comment
khimru's avatar

It's not a “weird problem”. It's just what LLMs *ARE*. They are Google search cousins, better in some aspects, worse in some aspects. They find the “best pieces” in the corpus of documents on the web and combine these in the “best way”. That's it, no more, no less.

And just like Google Search would give you best answer if your search request is short and focused… LLMs would work best when input is short. And by talking to them you are increasing it, instead.

If you want something from LLM then the best way is to change the request till you would get an acceptable answer. ChatGPT have an UI for that.

Expand full comment
Stephen Bosch's avatar

How is engaging in a dialogue with it not "changing the request"? And where is the value-add if I have to be clairvoyant to determine what to ask it in order to get a usable response?

Asking more questions that build on its previous output delivers more information about my expectations and ought to improve the output, but it does not.

Why not? Because this supposed AI is artificial, but not intelligent.

Expand full comment
khimru's avatar

When you are “engaging in dialogue” you are adding lots of data that makes it ever harder for LLM to properly cull its database. At some point you overwhelm it and it starts producing nonsense.

And value-add is in LLMs knowledge database: LLMs neural network includes a lot of facts that are not known to you – and, importantly, it may pull snippets from it using very vague requests… much more vague then what you need to use Google Search.

There are no need to be “clairvoyant”: when you see that response is unsatisfying you just click “edit” button and CHANGE the prompt. NOT try to do a dialogue, but REFINE THE PROMPT.

You are acting now like a kid, honestly: you were promised some kind of “intelligence” and are now crying “foul” because LLM is not that… well, it's still a useful tool, just there are no intelligence involved.

Ignore marketing and use LLM like you would use a weird Google Search derivative… results would be much closer to expectations.

Expand full comment
Stephen Bosch's avatar

Come on.

You haven't answered the most important question, which is *how is this better?*

I can ask a search engine a natural language question and get better results than I do from ChatGPT, and I don't have to waste 20 minutes twiddling prompts to do it.

Expand full comment
khimru's avatar

Are we talking about LLMs or ChatGPT, specifically? Search engines use LLMs, these days – that's how you "ask a search engine a natural language question".

As for why ChatGPT would be a better… it can combine answer into coherent text that you may include directly in your mail or something like that.

Whether that ability is worth billions or not is good question.

Expand full comment
Stephen Bosch's avatar

I have been giving search engines natural language queries for more than fifteen years and have always obtained satisfactory results.

You don't need an LLM for a natural language query to work. Semantic search doesn't use LLMs and can deliver good results.

But let's assume you're right. Then LLMs have been around a long time (so, not novel) and they've hardly improved. Certainly not enough to justify all the money being poured into them.

Expand full comment
Fabian Transchel's avatar

ChatGPT was introduced two and a half years ago.

You'd think that'd be enough time fix some "minor weaknesses".

We know of course that these problems are structural, but I mean c'mon: It's such a joke that these guys still get VC money and have literally *nothing* to show for it but stuff that is made up a little better than the stuff made up before.

Expand full comment
JohnnyW's avatar

It's almost as if LLMs are inherently unreliable?

Expand full comment
Youssef alHotsefot's avatar

And, of course, no admission by the manufacturer that there's anything wrong with the product.

The eternal claim: The parrot is not dead.

Expand full comment
Gary Marcus's avatar

Perfect skit for the LLM era

Expand full comment
Youssef alHotsefot's avatar

Just in case anyone doesn't know what Professor Gary is referring to:

https://youtu.be/vnciwwsvNcc

Expand full comment
A.I. Freeman's avatar

Clearly not a PhD in geography or economics.

Expand full comment
Larry Jewett's avatar

At least not in geography.

Expand full comment
Larry Jewett's avatar

After all, it did give multiple answers to the same question.

Expand full comment
mpsingh's avatar

The PhD level reasoner making toddler level mistakes.

Wonder when the hype will die, these systems are not at all fundamentally better than the sort of AI that was ridiculed by researchers pre ChatGPT.

I've been thinking about this in for the past few days and as far as I've seen there is not real progress in anything truly new to motivate the myopic view of exponential growth. And lot of the foundations of modern AI were layed decades ago.

For example even most neurosymbolic methods I see are tape together arts and craft, more or less llms + verifier, (In that arent the agentic llm coders neurosymbolic, with compiler output as a verifier. this is quite close to what funsearch was)

Or the 'RL' pretraining of 'reasoning' llms

Expand full comment
Gerben Wierda's avatar

There is an endless reserve of examples like this. Approximating the result of understanding by using token statistic of results of actual understanding isn't understanding. It is especially true when logical reasoning is your goal.

Add to that all the RLHF strengthening of producing stuff people like to hear (which has a somewhat low correlation with what is true) and you get this.

So, you will be able to show these examples for years to come.

What is worrisome is that humankind's collective understanding is now being attacked by humans that employ machines that exploit our own weaknesses. Which is why showing these examples isn't moving the needle much. I respect your stamina, but the remedy against stupidity is not more true observations and facts (cf. Bonhoeffer and others).

Expand full comment
Maarten Keulemans's avatar

I'm pretty sure that once AI *does* acquire AGI one day, you'll be the first person it'll joyfully turn into paperclips, Gary ;-)

Expand full comment
Alex Tolley's avatar

Very interesting failure mode. I have seen multiple "oops, sorry" failures in another domain. At some point, one has to say, "I can do this faster myself, just give me the data table[s].

Question: Would it have done any better if you had asked it to list the states with both major [sea] ports and above-average incomes, leaving out the need to color a map? Then, in the second stage, ask the LLM to color the map with the list of states if that list was correct.?

[BTW, a Google search this morning failed to find the fairly simple name that I was looking for, forcing me to track it down on my own. More enshittification of Google search?]

Expand full comment
Stephen Bosch's avatar

It doesn't even get the lists right.

"Give me a list of all the government procurement portals for the German federal states with their URLs, please."

I get:

- a list with no URLs,

- containing portal names that were entirely made up

- or commercial aggregators that ChatGPT claimed were public

- or URLs that don't exist

Rinse and repeat. After wasting 30 minutes I made my own list with conventional search in five.

Good luck getting any accurate output out of it for anything that matters. I have already given up.

Expand full comment
Alex Tolley's avatar

Your experience supports what I have suggested - use RAG with documents that contain the data, and just use the LLM as a natural language interface. It won't solve all the problems, like correctly coloring a map, but at least it will get the data from a table if it exists in the database.

Yesterday, I was watching a spectacular fail of Google's coding "assistant", Gemini 2.5, in Google's Firebase Studio to build a simple application. It couldn't manage to build a proper working prototype with multiple attempts to correct the LLM, frustrating the person doing the demo, who wanted it to show its paces. Coding is one domain where these LLMs were showing promise, and I have managed to get an LLM to do very simple toy problems correctly. Supposedly solved math problems way above my capability are also rapidly improving, but how does one validate the output, as it cannot be tested without having the needed expertise?

I am reminded of the scene in the TV series "From the Earth to the Moon" episode Spider. The LM keeps breaking a leg during testing. The engineer responsible for doing the calculations realizes he made a sign error and, ashamed, sees the program's boss. If a math-domain LLM made a similar error, who would detect it? The LLM wouldn't by itself. The potential for making errors is very concerning in critical systems. Lists one can check. Map coloring can be checked. Code errors can be checked by running unit tests. But advanced math errors. To me it suggests that an LLM should be linked via an API to math software like Mathematica to ensure it does the correct transforms and that the prompts are geared to the way the software operates.

Expand full comment
Nate's avatar

Which version of ChatGPT was this?

Expand full comment
Jonathan Richman's avatar

This is really a pretty vague question: "give me a map of the US and highlight states with (major) ports and above average income"

How do you define major? And why would you want it to leave out the midwest states? I'd expect any intern to get this question "wrong" also because the criteria aren't clear.

In addition, the model you use in this case is VERY important. A general model without reasoning like 4o (which I'm almost positive you used based on the answers) isn't really great at this. The question involves four separate steps: 1) Look up port information, 2) look up income information, 3) combine those data, 4) write code to draw map. Your using two separate tools: 1) web search and 2) coding/analysis.

I tried this with OpenAI's o3 model and even with the vague ask, it produced a perfect map (Substack won't let me share the image though).

Expand full comment
Gary Marcus's avatar

it introduced the word major and then waffled. are you even sure it got it right? many of the answers might have looked convincing.

and the point is that each were presented as correct (or corrected). that’s not on me.

Expand full comment
Sid Kaye's avatar

"Conversation inaccessible or not found. You may need to switch accounts or request access if this conversation exists."

Expand full comment
Jonathan Richman's avatar

Might be because it's an enterprise account. Test it yourself with o3 and the prompt:

give me a map of the US and highlight states with (major) ports and above average income

Expand full comment
Sid Kaye's avatar

Just saw your response today... and this is MOST interesting -- ChatGPT's own explanation for what is going on is that access is contextual, and what's being presented as real is not necessarily so.

https://chatgpt.com/share/6837269a-3fbc-8013-8a89-84450e378e0c

Expand full comment
JohnnyW's avatar

The AI enthusiasts always blame the user... "You're asking it wrong!"

Here was an encounter I had today. I suppose you will tell me I did it wrong, too?

https://chatgpt.com/share/68222e3c-4fe4-8010-9b0c-dc2e730c2d11

Expand full comment
Jonathan Richman's avatar

Again. You need to use the right model and the right tools. Granted…you shouldn't have to do this work or have this knowledge. It should “just work”. https://chatgpt.com/share/682290cb-eb84-8012-8eb6-c0b9847e0a58

Expand full comment
jibal jibal's avatar

intellectually dishonest rationalizations

Expand full comment
Jonathan Richman's avatar

Feels like the majority of these comments are "intellectually dishonest rationalizations"

Expand full comment
jibal jibal's avatar

To you it would.

Expand full comment
JohnnyW's avatar

This is literally "you're asking it wrong!". Don't you see that you so want AI to be real and reliable that you're lying to yourself. When you see it's make a mistake, you blame yourself... "I used the wrong tool for the job" or "I didn't ask in the right way".

The truth is that 4o originally answered similarly to o3 when I tried. And then I opened a new chat and asked it again to be sure... and got the chat you saw. I asked it again and it answered the original way. LLMs are inherently random. It's built into their programming to make them sound human.

You can't just stop asking because you got the "right" answer. You should open a new chat with zero history and ask it again. Then you will see that a) you will get different answers to the same question at different times. And b) you only really know when it's the "right" answer when you already knew the answer... but what happens when you ask it about something you don't already know the answer to? You're likely to blindly trust it.

Expand full comment
Hans Jurgens Smit's avatar

well, this is exactly it. You cant give a monkey a laptop and expect anything useful. Using AI is a skill, one that takes practise. Its very clear you havent put in the effort to become good at it.

Expand full comment
ami's avatar
May 14Edited

I tried with OpenAI's free reasoning model (it should be o3 mini) replacing US states with italian regions and it gave me an even worse result than Gary's one: it got everything wrong swapping cities for regions, getting their geographical position wrong, and coloring regions that don’t have ports

https://chatgpt.com/share/682458b4-1b7c-8004-a641-dc79f4ea313e

Sure, the prompt is generic but it does geographical mistakes that even a child wouldn't do

Expand full comment
Daniel Tucker's avatar

Of course it wouldn't let you...how was Gary able to share his? I think the question is more than specific enough for a system purporting to do what these systems are supposed to do.

Expand full comment
Sid Kaye's avatar

Conversation inaccessible or not found. You may need to switch accounts or request access if this conversation exists.

Expand full comment
Jonathan Richman's avatar

Might be because it's an enterprise account. Test it yourself with o3 and the prompt:

give me a map of the US and highlight states with (major) ports and above average income

Expand full comment
Jonathan Richman's avatar

I meant share the image of the map since you can’t do attachments

Expand full comment
Gene's avatar

Once ChatGPT gives a hallucinatory answer, no amount of prompting seems to be able to fix it. It follows a predictable pattern: overly enthusiastic praise for spotting the error, effusive apologies, and then even worse mistakes, like a panicked student unraveling under pressure.

Expand full comment
Eric Cort Platt's avatar

I've noticed the same pattern with Grok. I've had to start a new chat thread to get out of the self-corection feedback spinout hole it gets into. It can't seem to "forget", is almost obsessive...

Expand full comment
Gene's avatar

What if there were a dedicated teacher LLP whose sole role was to spot BS from “reasoning “ LLPs? They would operate like a Sherlock Holmes/Dr. Watson team, using the Socratic method to uncover contradictions through dialogue.

Expand full comment
Meinolf Sellmann's avatar

Clearly PhD-level! Maybe someone needs to clean house at OpenAI’s admissions? ;)

Expand full comment
Gerard's avatar

Strange. It almost feels like ChatGPT was never really ‘PhD-level intelligent’ to begin with. Just another claim to chalk up as marketing spin from OpenAI.

Also, I clearly remember them promoting the memory feature as some big breakthrough. But if it can’t even keep track of things within a single conversation, how is it supposed to work for something I said three months ago? That seems about as likely as Christmas coming tomorrow.

Expand full comment
Te Reagan's avatar

I have it logging Heath data for me.

It keeps only logging one day.

I have to constantly ask it to go back through chat and log it all, and keep up with the history.

I’ve only been doing this a couple of weeks. I’m already frustrated at having to continually ask Chatbox to do what I asked it to do in the first place.

Chatbox says that it has memory update problems. Yup, that what it told me.

Expand full comment
Patrick Logan's avatar

"You are sharp to notice that I have no concept of truth maintenance nor even a concept of truth."

-- ChatGPT, maybe.

Expand full comment
James Quilligan's avatar

AI is as random in its capacity to follow instructions and make simple calculations as Trump's wacky tariffs. But at least AI is humble and apolgizes when it errs, unlike Trumpelstiltskin, who unashamedly leads us all to believe that straw can be turned into gold. Neither of them is a trustworthy source of information and direction, but I'll take AI's truth over Trump's truth any day because AI eventually learns the logic of course correction.

Expand full comment