I work a lot with LLMs and this is a weird recurring problem I see. When you request a list of things - basically what you did here - if they get multiple items in that list wrong they often can never get it right. Try asking questions that requires Chat-GPT to come up with a list of Roman Emperors who did X.
Basically anything with even slightly grey borders seems to make it lose the plot. Ask it for 10 famous quotes from Roman Emperors. I always get Julius Caesar, or Cicero, or a pope or two in there. It’ll admit that it got it wrong and then give a revised list with quotes from Virgil, Tacitus and Mussolini.
It's not a “weird problem”. It's just what LLMs *ARE*. They are Google search cousins, better in some aspects, worse in some aspects. They find the “best pieces” in the corpus of documents on the web and combine these in the “best way”. That's it, no more, no less.
And just like Google Search would give you best answer if your search request is short and focused… LLMs would work best when input is short. And by talking to them you are increasing it, instead.
If you want something from LLM then the best way is to change the request till you would get an acceptable answer. ChatGPT have an UI for that.
How is engaging in a dialogue with it not "changing the request"? And where is the value-add if I have to be clairvoyant to determine what to ask it in order to get a usable response?
Asking more questions that build on its previous output delivers more information about my expectations and ought to improve the output, but it does not.
Why not? Because this supposed AI is artificial, but not intelligent.
When you are “engaging in dialogue” you are adding lots of data that makes it ever harder for LLM to properly cull its database. At some point you overwhelm it and it starts producing nonsense.
And value-add is in LLMs knowledge database: LLMs neural network includes a lot of facts that are not known to you – and, importantly, it may pull snippets from it using very vague requests… much more vague then what you need to use Google Search.
There are no need to be “clairvoyant”: when you see that response is unsatisfying you just click “edit” button and CHANGE the prompt. NOT try to do a dialogue, but REFINE THE PROMPT.
You are acting now like a kid, honestly: you were promised some kind of “intelligence” and are now crying “foul” because LLM is not that… well, it's still a useful tool, just there are no intelligence involved.
Ignore marketing and use LLM like you would use a weird Google Search derivative… results would be much closer to expectations.
You haven't answered the most important question, which is *how is this better?*
I can ask a search engine a natural language question and get better results than I do from ChatGPT, and I don't have to waste 20 minutes twiddling prompts to do it.
Are we talking about LLMs or ChatGPT, specifically? Search engines use LLMs, these days – that's how you "ask a search engine a natural language question".
As for why ChatGPT would be a better… it can combine answer into coherent text that you may include directly in your mail or something like that.
Whether that ability is worth billions or not is good question.
I have been giving search engines natural language queries for more than fifteen years and have always obtained satisfactory results.
You don't need an LLM for a natural language query to work. Semantic search doesn't use LLMs and can deliver good results.
But let's assume you're right. Then LLMs have been around a long time (so, not novel) and they've hardly improved. Certainly not enough to justify all the money being poured into them.
You'd think that'd be enough time fix some "minor weaknesses".
We know of course that these problems are structural, but I mean c'mon: It's such a joke that these guys still get VC money and have literally *nothing* to show for it but stuff that is made up a little better than the stuff made up before.
The PhD level reasoner making toddler level mistakes.
Wonder when the hype will die, these systems are not at all fundamentally better than the sort of AI that was ridiculed by researchers pre ChatGPT.
I've been thinking about this in for the past few days and as far as I've seen there is not real progress in anything truly new to motivate the myopic view of exponential growth. And lot of the foundations of modern AI were layed decades ago.
For example even most neurosymbolic methods I see are tape together arts and craft, more or less llms + verifier, (In that arent the agentic llm coders neurosymbolic, with compiler output as a verifier. this is quite close to what funsearch was)
There is an endless reserve of examples like this. Approximating the result of understanding by using token statistic of results of actual understanding isn't understanding. It is especially true when logical reasoning is your goal.
Add to that all the RLHF strengthening of producing stuff people like to hear (which has a somewhat low correlation with what is true) and you get this.
So, you will be able to show these examples for years to come.
What is worrisome is that humankind's collective understanding is now being attacked by humans that employ machines that exploit our own weaknesses. Which is why showing these examples isn't moving the needle much. I respect your stamina, but the remedy against stupidity is not more true observations and facts (cf. Bonhoeffer and others).
Very interesting failure mode. I have seen multiple "oops, sorry" failures in another domain. At some point, one has to say, "I can do this faster myself, just give me the data table[s].
Question: Would it have done any better if you had asked it to list the states with both major [sea] ports and above-average incomes, leaving out the need to color a map? Then, in the second stage, ask the LLM to color the map with the list of states if that list was correct.?
[BTW, a Google search this morning failed to find the fairly simple name that I was looking for, forcing me to track it down on my own. More enshittification of Google search?]
Your experience supports what I have suggested - use RAG with documents that contain the data, and just use the LLM as a natural language interface. It won't solve all the problems, like correctly coloring a map, but at least it will get the data from a table if it exists in the database.
Yesterday, I was watching a spectacular fail of Google's coding "assistant", Gemini 2.5, in Google's Firebase Studio to build a simple application. It couldn't manage to build a proper working prototype with multiple attempts to correct the LLM, frustrating the person doing the demo, who wanted it to show its paces. Coding is one domain where these LLMs were showing promise, and I have managed to get an LLM to do very simple toy problems correctly. Supposedly solved math problems way above my capability are also rapidly improving, but how does one validate the output, as it cannot be tested without having the needed expertise?
I am reminded of the scene in the TV series "From the Earth to the Moon" episode Spider. The LM keeps breaking a leg during testing. The engineer responsible for doing the calculations realizes he made a sign error and, ashamed, sees the program's boss. If a math-domain LLM made a similar error, who would detect it? The LLM wouldn't by itself. The potential for making errors is very concerning in critical systems. Lists one can check. Map coloring can be checked. Code errors can be checked by running unit tests. But advanced math errors. To me it suggests that an LLM should be linked via an API to math software like Mathematica to ensure it does the correct transforms and that the prompts are geared to the way the software operates.
This is really a pretty vague question: "give me a map of the US and highlight states with (major) ports and above average income"
How do you define major? And why would you want it to leave out the midwest states? I'd expect any intern to get this question "wrong" also because the criteria aren't clear.
In addition, the model you use in this case is VERY important. A general model without reasoning like 4o (which I'm almost positive you used based on the answers) isn't really great at this. The question involves four separate steps: 1) Look up port information, 2) look up income information, 3) combine those data, 4) write code to draw map. Your using two separate tools: 1) web search and 2) coding/analysis.
I tried this with OpenAI's o3 model and even with the vague ask, it produced a perfect map (Substack won't let me share the image though).
Just saw your response today... and this is MOST interesting -- ChatGPT's own explanation for what is going on is that access is contextual, and what's being presented as real is not necessarily so.
This is literally "you're asking it wrong!". Don't you see that you so want AI to be real and reliable that you're lying to yourself. When you see it's make a mistake, you blame yourself... "I used the wrong tool for the job" or "I didn't ask in the right way".
The truth is that 4o originally answered similarly to o3 when I tried. And then I opened a new chat and asked it again to be sure... and got the chat you saw. I asked it again and it answered the original way. LLMs are inherently random. It's built into their programming to make them sound human.
You can't just stop asking because you got the "right" answer. You should open a new chat with zero history and ask it again. Then you will see that a) you will get different answers to the same question at different times. And b) you only really know when it's the "right" answer when you already knew the answer... but what happens when you ask it about something you don't already know the answer to? You're likely to blindly trust it.
well, this is exactly it. You cant give a monkey a laptop and expect anything useful. Using AI is a skill, one that takes practise. Its very clear you havent put in the effort to become good at it.
I tried with OpenAI's free reasoning model (it should be o3 mini) replacing US states with italian regions and it gave me an even worse result than Gary's one: it got everything wrong swapping cities for regions, getting their geographical position wrong, and coloring regions that don’t have ports
Of course it wouldn't let you...how was Gary able to share his? I think the question is more than specific enough for a system purporting to do what these systems are supposed to do.
Once ChatGPT gives a hallucinatory answer, no amount of prompting seems to be able to fix it. It follows a predictable pattern: overly enthusiastic praise for spotting the error, effusive apologies, and then even worse mistakes, like a panicked student unraveling under pressure.
What if there were a dedicated teacher LLP whose sole role was to spot BS from “reasoning “ LLPs? They would operate like a Sherlock Holmes/Dr. Watson team, using the Socratic method to uncover contradictions through dialogue.
I've noticed the same pattern with Grok. I've had to start a new chat thread to get out of the self-corection feedback spinout hole it gets into. It can't seem to "forget", is almost obsessive...
Strange. It almost feels like ChatGPT was never really ‘PhD-level intelligent’ to begin with. Just another claim to chalk up as marketing spin from OpenAI.
Also, I clearly remember them promoting the memory feature as some big breakthrough. But if it can’t even keep track of things within a single conversation, how is it supposed to work for something I said three months ago? That seems about as likely as Christmas coming tomorrow.
I have to constantly ask it to go back through chat and log it all, and keep up with the history.
I’ve only been doing this a couple of weeks. I’m already frustrated at having to continually ask Chatbox to do what I asked it to do in the first place.
Chatbox says that it has memory update problems. Yup, that what it told me.
AI is as random in its capacity to follow instructions and make simple calculations as Trump's wacky tariffs. But at least AI is humble and apolgizes when it errs, unlike Trumpelstiltskin, who unashamedly leads us all to believe that straw can be turned into gold. Neither of them is a trustworthy source of information and direction, but I'll take AI's truth over Trump's truth any day because AI eventually learns the logic of course correction.
I work a lot with LLMs and this is a weird recurring problem I see. When you request a list of things - basically what you did here - if they get multiple items in that list wrong they often can never get it right. Try asking questions that requires Chat-GPT to come up with a list of Roman Emperors who did X.
Basically anything with even slightly grey borders seems to make it lose the plot. Ask it for 10 famous quotes from Roman Emperors. I always get Julius Caesar, or Cicero, or a pope or two in there. It’ll admit that it got it wrong and then give a revised list with quotes from Virgil, Tacitus and Mussolini.
It's not a “weird problem”. It's just what LLMs *ARE*. They are Google search cousins, better in some aspects, worse in some aspects. They find the “best pieces” in the corpus of documents on the web and combine these in the “best way”. That's it, no more, no less.
And just like Google Search would give you best answer if your search request is short and focused… LLMs would work best when input is short. And by talking to them you are increasing it, instead.
If you want something from LLM then the best way is to change the request till you would get an acceptable answer. ChatGPT have an UI for that.
How is engaging in a dialogue with it not "changing the request"? And where is the value-add if I have to be clairvoyant to determine what to ask it in order to get a usable response?
Asking more questions that build on its previous output delivers more information about my expectations and ought to improve the output, but it does not.
Why not? Because this supposed AI is artificial, but not intelligent.
When you are “engaging in dialogue” you are adding lots of data that makes it ever harder for LLM to properly cull its database. At some point you overwhelm it and it starts producing nonsense.
And value-add is in LLMs knowledge database: LLMs neural network includes a lot of facts that are not known to you – and, importantly, it may pull snippets from it using very vague requests… much more vague then what you need to use Google Search.
There are no need to be “clairvoyant”: when you see that response is unsatisfying you just click “edit” button and CHANGE the prompt. NOT try to do a dialogue, but REFINE THE PROMPT.
You are acting now like a kid, honestly: you were promised some kind of “intelligence” and are now crying “foul” because LLM is not that… well, it's still a useful tool, just there are no intelligence involved.
Ignore marketing and use LLM like you would use a weird Google Search derivative… results would be much closer to expectations.
Come on.
You haven't answered the most important question, which is *how is this better?*
I can ask a search engine a natural language question and get better results than I do from ChatGPT, and I don't have to waste 20 minutes twiddling prompts to do it.
Are we talking about LLMs or ChatGPT, specifically? Search engines use LLMs, these days – that's how you "ask a search engine a natural language question".
As for why ChatGPT would be a better… it can combine answer into coherent text that you may include directly in your mail or something like that.
Whether that ability is worth billions or not is good question.
I have been giving search engines natural language queries for more than fifteen years and have always obtained satisfactory results.
You don't need an LLM for a natural language query to work. Semantic search doesn't use LLMs and can deliver good results.
But let's assume you're right. Then LLMs have been around a long time (so, not novel) and they've hardly improved. Certainly not enough to justify all the money being poured into them.
ChatGPT was introduced two and a half years ago.
You'd think that'd be enough time fix some "minor weaknesses".
We know of course that these problems are structural, but I mean c'mon: It's such a joke that these guys still get VC money and have literally *nothing* to show for it but stuff that is made up a little better than the stuff made up before.
It's almost as if LLMs are inherently unreliable?
And, of course, no admission by the manufacturer that there's anything wrong with the product.
The eternal claim: The parrot is not dead.
Perfect skit for the LLM era
Just in case anyone doesn't know what Professor Gary is referring to:
https://youtu.be/vnciwwsvNcc
Clearly not a PhD in geography or economics.
At least not in geography.
After all, it did give multiple answers to the same question.
The PhD level reasoner making toddler level mistakes.
Wonder when the hype will die, these systems are not at all fundamentally better than the sort of AI that was ridiculed by researchers pre ChatGPT.
I've been thinking about this in for the past few days and as far as I've seen there is not real progress in anything truly new to motivate the myopic view of exponential growth. And lot of the foundations of modern AI were layed decades ago.
For example even most neurosymbolic methods I see are tape together arts and craft, more or less llms + verifier, (In that arent the agentic llm coders neurosymbolic, with compiler output as a verifier. this is quite close to what funsearch was)
Or the 'RL' pretraining of 'reasoning' llms
There is an endless reserve of examples like this. Approximating the result of understanding by using token statistic of results of actual understanding isn't understanding. It is especially true when logical reasoning is your goal.
Add to that all the RLHF strengthening of producing stuff people like to hear (which has a somewhat low correlation with what is true) and you get this.
So, you will be able to show these examples for years to come.
What is worrisome is that humankind's collective understanding is now being attacked by humans that employ machines that exploit our own weaknesses. Which is why showing these examples isn't moving the needle much. I respect your stamina, but the remedy against stupidity is not more true observations and facts (cf. Bonhoeffer and others).
I'm pretty sure that once AI *does* acquire AGI one day, you'll be the first person it'll joyfully turn into paperclips, Gary ;-)
Very interesting failure mode. I have seen multiple "oops, sorry" failures in another domain. At some point, one has to say, "I can do this faster myself, just give me the data table[s].
Question: Would it have done any better if you had asked it to list the states with both major [sea] ports and above-average incomes, leaving out the need to color a map? Then, in the second stage, ask the LLM to color the map with the list of states if that list was correct.?
[BTW, a Google search this morning failed to find the fairly simple name that I was looking for, forcing me to track it down on my own. More enshittification of Google search?]
It doesn't even get the lists right.
"Give me a list of all the government procurement portals for the German federal states with their URLs, please."
I get:
- a list with no URLs,
- containing portal names that were entirely made up
- or commercial aggregators that ChatGPT claimed were public
- or URLs that don't exist
Rinse and repeat. After wasting 30 minutes I made my own list with conventional search in five.
Good luck getting any accurate output out of it for anything that matters. I have already given up.
Your experience supports what I have suggested - use RAG with documents that contain the data, and just use the LLM as a natural language interface. It won't solve all the problems, like correctly coloring a map, but at least it will get the data from a table if it exists in the database.
Yesterday, I was watching a spectacular fail of Google's coding "assistant", Gemini 2.5, in Google's Firebase Studio to build a simple application. It couldn't manage to build a proper working prototype with multiple attempts to correct the LLM, frustrating the person doing the demo, who wanted it to show its paces. Coding is one domain where these LLMs were showing promise, and I have managed to get an LLM to do very simple toy problems correctly. Supposedly solved math problems way above my capability are also rapidly improving, but how does one validate the output, as it cannot be tested without having the needed expertise?
I am reminded of the scene in the TV series "From the Earth to the Moon" episode Spider. The LM keeps breaking a leg during testing. The engineer responsible for doing the calculations realizes he made a sign error and, ashamed, sees the program's boss. If a math-domain LLM made a similar error, who would detect it? The LLM wouldn't by itself. The potential for making errors is very concerning in critical systems. Lists one can check. Map coloring can be checked. Code errors can be checked by running unit tests. But advanced math errors. To me it suggests that an LLM should be linked via an API to math software like Mathematica to ensure it does the correct transforms and that the prompts are geared to the way the software operates.
Which version of ChatGPT was this?
This is really a pretty vague question: "give me a map of the US and highlight states with (major) ports and above average income"
How do you define major? And why would you want it to leave out the midwest states? I'd expect any intern to get this question "wrong" also because the criteria aren't clear.
In addition, the model you use in this case is VERY important. A general model without reasoning like 4o (which I'm almost positive you used based on the answers) isn't really great at this. The question involves four separate steps: 1) Look up port information, 2) look up income information, 3) combine those data, 4) write code to draw map. Your using two separate tools: 1) web search and 2) coding/analysis.
I tried this with OpenAI's o3 model and even with the vague ask, it produced a perfect map (Substack won't let me share the image though).
it introduced the word major and then waffled. are you even sure it got it right? many of the answers might have looked convincing.
and the point is that each were presented as correct (or corrected). that’s not on me.
Here it is. https://chatgpt.com/share/e/68234399-b2d4-800a-8c9d-643470c79d28
"Conversation inaccessible or not found. You may need to switch accounts or request access if this conversation exists."
Might be because it's an enterprise account. Test it yourself with o3 and the prompt:
give me a map of the US and highlight states with (major) ports and above average income
Just saw your response today... and this is MOST interesting -- ChatGPT's own explanation for what is going on is that access is contextual, and what's being presented as real is not necessarily so.
https://chatgpt.com/share/6837269a-3fbc-8013-8a89-84450e378e0c
and it gets worse from there:
https://chatgpt.com/share/6837340d-c778-8013-a9ee-af5e8f1d1968
The AI enthusiasts always blame the user... "You're asking it wrong!"
Here was an encounter I had today. I suppose you will tell me I did it wrong, too?
https://chatgpt.com/share/68222e3c-4fe4-8010-9b0c-dc2e730c2d11
Again. You need to use the right model and the right tools. Granted…you shouldn't have to do this work or have this knowledge. It should “just work”. https://chatgpt.com/share/682290cb-eb84-8012-8eb6-c0b9847e0a58
intellectually dishonest rationalizations
Feels like the majority of these comments are "intellectually dishonest rationalizations"
To you it would.
This is literally "you're asking it wrong!". Don't you see that you so want AI to be real and reliable that you're lying to yourself. When you see it's make a mistake, you blame yourself... "I used the wrong tool for the job" or "I didn't ask in the right way".
The truth is that 4o originally answered similarly to o3 when I tried. And then I opened a new chat and asked it again to be sure... and got the chat you saw. I asked it again and it answered the original way. LLMs are inherently random. It's built into their programming to make them sound human.
You can't just stop asking because you got the "right" answer. You should open a new chat with zero history and ask it again. Then you will see that a) you will get different answers to the same question at different times. And b) you only really know when it's the "right" answer when you already knew the answer... but what happens when you ask it about something you don't already know the answer to? You're likely to blindly trust it.
well, this is exactly it. You cant give a monkey a laptop and expect anything useful. Using AI is a skill, one that takes practise. Its very clear you havent put in the effort to become good at it.
I tried with OpenAI's free reasoning model (it should be o3 mini) replacing US states with italian regions and it gave me an even worse result than Gary's one: it got everything wrong swapping cities for regions, getting their geographical position wrong, and coloring regions that don’t have ports
https://chatgpt.com/share/682458b4-1b7c-8004-a641-dc79f4ea313e
Sure, the prompt is generic but it does geographical mistakes that even a child wouldn't do
Of course it wouldn't let you...how was Gary able to share his? I think the question is more than specific enough for a system purporting to do what these systems are supposed to do.
Sorry...forgot about that feature: https://chatgpt.com/share/e/68234399-b2d4-800a-8c9d-643470c79d28
Conversation inaccessible or not found. You may need to switch accounts or request access if this conversation exists.
Might be because it's an enterprise account. Test it yourself with o3 and the prompt:
give me a map of the US and highlight states with (major) ports and above average income
I meant share the image of the map since you can’t do attachments
Once ChatGPT gives a hallucinatory answer, no amount of prompting seems to be able to fix it. It follows a predictable pattern: overly enthusiastic praise for spotting the error, effusive apologies, and then even worse mistakes, like a panicked student unraveling under pressure.
What if there were a dedicated teacher LLP whose sole role was to spot BS from “reasoning “ LLPs? They would operate like a Sherlock Holmes/Dr. Watson team, using the Socratic method to uncover contradictions through dialogue.
I've noticed the same pattern with Grok. I've had to start a new chat thread to get out of the self-corection feedback spinout hole it gets into. It can't seem to "forget", is almost obsessive...
Clearly PhD-level! Maybe someone needs to clean house at OpenAI’s admissions? ;)
Strange. It almost feels like ChatGPT was never really ‘PhD-level intelligent’ to begin with. Just another claim to chalk up as marketing spin from OpenAI.
Also, I clearly remember them promoting the memory feature as some big breakthrough. But if it can’t even keep track of things within a single conversation, how is it supposed to work for something I said three months ago? That seems about as likely as Christmas coming tomorrow.
I have it logging Heath data for me.
It keeps only logging one day.
I have to constantly ask it to go back through chat and log it all, and keep up with the history.
I’ve only been doing this a couple of weeks. I’m already frustrated at having to continually ask Chatbox to do what I asked it to do in the first place.
Chatbox says that it has memory update problems. Yup, that what it told me.
"You are sharp to notice that I have no concept of truth maintenance nor even a concept of truth."
-- ChatGPT, maybe.
AI is as random in its capacity to follow instructions and make simple calculations as Trump's wacky tariffs. But at least AI is humble and apolgizes when it errs, unlike Trumpelstiltskin, who unashamedly leads us all to believe that straw can be turned into gold. Neither of them is a trustworthy source of information and direction, but I'll take AI's truth over Trump's truth any day because AI eventually learns the logic of course correction.