Marcus on AI

The claim of “its getting better” is reminiscent of the guy in Monty Python’s Life of Brian who claimed “she turned me into a newt” but that he “got better”

Expand full comment

Mr. Doolittle

They wouldn't last long in low performing teams, repeating the same mistakes.

Expand full comment

Even if senior staff. But you're right. It's a pretty nice tool but not a replacement.

Expand full comment

AJB

Feb 4Edited

Oh and lest you think that while the "free" versions are the only one's flawed, no. The fancy, expensive ones from supposedly hallucination free legal services like LexisNexis are STILL hallucinating wildly, 17-34% of the time - according to a recent audit by Stanford. Attorneys foolish enough to fire their paralegals and rely instead upon a LexisNexis or Westlaw AI agent are finding themselves in peril when the judge notices that the case presented as precedent was entirely made up by the AI machine. In a new preprint study by Stanford RegLab and HAI researchers, we put the claims of two providers, LexisNexis (creator of Lexis+ AI) and Thomson Reuters (creator of Westlaw AI-Assisted Research and Ask Practical Law AI)), to the test. We show that their tools do reduce errors compared to general-purpose AI models like GPT-4. That is a substantial improvement and we document instances where these tools provide sound and detailed legal research. But even these bespoke legal AI tools still hallucinate an alarming amount of the time: the Lexis+ AI and Ask Practical Law AI systems produced incorrect information more than 17% of the time, while Westlaw’s AI-Assisted Research hallucinated more than 34% of the time. https://hai.stanford.edu/news/ai-trial-legal-models-hallucinate-1-out-6-or-more-benchmarking-queries

Expand full comment

Reply (4)

Amy A

I’m a regular Nexis user and have noticed more and more errors in their data - certain company files stating revenue in the billions for tiny family owned companies (across all companies all at once). This error is an obvious one, but it makes it hard to trust the rest of their aggregated content….

Expand full comment

I’ve never used Nexis, but I have used the paid version of ChatGPT and find that it does the same thing. I’m always cautious when I use it and only trust it about 50%. That’s not a blind trust though because I’m always vigilant in checking on what it tells me.

Expand full comment

That's a good idea. I use it often and find is more like arguing with a five years old.

Expand full comment

Mindshift

If you only trust ChatGPT 50%, why don’t you just flip a coin?

Expand full comment

Because I’m a curious person and can’t help myself. I want to see what it’s like so I can understand what everyone’s talking about.

Expand full comment

Mar 25

It's a great assistant. It's like having someone to quickly bounce ideas off of. Usually there isn't anyone. Bouncing then off of yourself takes considerable time. But a replacement exactly, but it assists.

Expand full comment

Mar 26

It is quite a magical assistant, in my experience.

Expand full comment

The problem is that I’m not talking about an AI tool. Regular Nexis files are getting corrupted because of over reliance on algorithms and AI.

Expand full comment

It’s pretty apparent that I don’t know what a Nexis file is. I thought it was another Chatbot. Lol

Expand full comment

The “Nexis of Evil” would probably be a good name for a chatbot

Expand full comment

Amy A

No worries, we can’t possibly know everything (and neither can AI, but I like humans better, lol)

Expand full comment

we can’t possibly know everything “

We” who?

Speak for yourself

Expand full comment

Thanks for the pass and I’m with you on the humans, but they too can sometimes be a pain in the butt..

Expand full comment

Yeah I'm sure those families would wonder where all that cash is going

Expand full comment

Najib Safieddine

Cool, thanks for sharing!

Expand full comment

Oh that's not great. That sounds like a nightmare. Oddly, we might find that double checking AI begins a new college standard class.

Expand full comment

Comment deleted

Feb 9

Comment deleted

Expand full comment

AJB

Feb 9

I have noticed that people who spend too much time coding and or interacting with AI output have greatly diminished brain power. Too much digitization - and taking the requisite drugs to keep up with the machines - rots the human brain and soul. Fortunately there is a solution. Unplug them. Caveat emptor.

Expand full comment

S.S.W.(ahiyantra)

Feb 4Edited

Here's the spelling of "prince edward island" changed to match chat-GPT-o's claims:

"prince edwardo aisland".

Expand full comment

Gary Marcus

🤣

Expand full comment

That’s just how you spell “island” in AI-land

Expand full comment

Bruce Olsen

This sounds like something from a sea shanty.

Expand full comment

Gerben Wierda

https://ea.rna.nl/2023/11/26/artificial-general-intelligence-is-nigh-rejoice-be-very-afraid/

Of course, we need to take into account that a free version should not be burdened with a lot of expectations (but we know the non-free versions have the same architecture at its core, so will suffer from the same, but in different ways).

Anyway, this reminded me of Ada Lovelace with her warning against computing hype when general computers were still only an idea and a century before they became reality (~1842 ):

"It is desirable to guard against the possibility of exaggerated ideas that might arise as to the powers of the Analytical Engine. In considering any new subject, there is frequently a tendency, first, to overrate what we find to be already interesting or remarkable; and, secondly, by a sort of natural reaction, to undervalue the true state of the case, when we do discover that our notions have surpassed those that were really tenable."

Expand full comment

Lucas da Cruz Bezerra

Something that has been is my mind is: ChatGPT is learning with content of web; more and more content of web is made by ChatGPT (which has factual erros); wouldn't this create a cascade of erros?

Expand full comment

Fabian Transchel

Yep. Read up on "model collapse".

Expand full comment

Richard Foxall

This has been the subject of many posts, including by Marcus. And, yes, it's an enormous problem. Especially when the LLM is specifically trained on data sets that are influenced by propaganda actors. See DeepSeek's results on anything related to China, Uighurs, Tiananmen Square, Taiwan... if it answers at all.

Expand full comment

mpsingh

ironically the actual traning in deepseek isn't even that biased or censored at all, and you can easily make it crticise china.

Simply that the hosted version automatically removes all answers related to the PRC, by a simply checking words on the output.

If you tell it to replace a few letters here and there...

Actual training maybe biased but its nothing like the guardrails in say claude

Expand full comment

Sam

This will sound naive, but do they not have ways of coping with that? Because this definitely sounds like a bad thing.

Expand full comment

Marcfont

Hi Gary, I am a frequent reader, but first time poster. I greatly appreciate your critical reviews, and firm grounding in reality. I also think that, for now, LLMs are great at memorising and pattern matching (a lot better than humans, obviously, given the absence of scale and scope limitations), but that is miles away from AGI. Sometimes it "looks like" AGI, but that's because the pattern matching statistically replicates human thinking patterns.

Anyway, I run your queries on other models, and....drum roll... o3 got it right at the first attempt, Gemini 1.5 Deep Research failed twice to complete any answer, it said "I'm just an LLM, can't help" 😉, Gemini 2.0 Experimental Advanced got it right, too.

It seems the models are improving, but, to make them more useful than harmful, we need to remember what they do.... Memorise a lot of stuff, and find similar patterns across. Useful, but not intelligent.

Expand full comment

At this point, he’s just using the older and worse models. His line of argument wouldn’t work at all if he used more modern ones (especially COT models).

Expand full comment

Nigel K Tolley

That's nonsense.

2 weeks ago I went to Claude, latest model. There were 3 sample test questions on the "suggested prompts", as usual. The middle one was the trivial "Whlch 1s bigger Test if Al knows which number is bigger." (AI text grabbed from a photo, excuse errors!)

> What is bigger, 9.9 or 9.11?

9.9 is smaller than 9.11.

9.11 has a larger digit in the tenths place (1) than 9.9. (which has 9 as the tenths digit and implicitly a 0 in the hundredths digit or 9.90).

I use chatgpt paid, too. It is also useless, in the same way I have to lead a school child to the right answer. It is less work, for something factual and requiring precision, to do it myself.

Expand full comment

Marcfont

I didn't test Claude Sonnet 3.5, but o3 mini and Gemini 2.0 advanced got it right, while 1.5 failed. The point remains, that these can be extraordinarily useful tools, if used with awareness of their limitations. AGI, they are not.

Expand full comment

Which OpenAI model did you use. Honestly, it sounds like a skills issue.

Expand full comment

Jonah

Feb 6Edited

Ironically, your comment unintentionally argues against itself. Why would something be a skill issue? Because the person, not the model, needs to have the skill to lead it to the correct answer...because the model does not actually give the correct answer!

Expand full comment

Tell me you don’t understand current LLM technology without telling me you don’t understand current LLM technology

Expand full comment

Jonah

Well, you could just read the comments where people posted answers from other models, and either they or I identified the mistakes in them.

Expand full comment

Some of them had errors the ones best suited for the task had none. Practical LLMs of this type have only been around for a few years. Look at the early history of any technology. Expecting instant perfection is absurd and places far more faith in the technology than is in any way reasonable.

Expand full comment

Gerard

https://ai-cosmos.hashnode.dev/the-transformer-rebranding-from-language-model-to-ai-intelligence

Great overview of current events with ChatGPT. It makes me wonder how so much misinformation spreads through AI. I think I have an answer. AI follows the same process every time to generate an output. This is a fascinating concept because it means AI will always produce results in the same way, regardless of whether it has been trained properly or not. In that sense, it functions like a calculator.

So how do we get misled? The problem lies in human interpretation. AI simply generates an output, but it is up to us to analyze it critically. This is where things get messy. To interpret AI’s output correctly, you need critical thinking and a solid understanding of the subject. More importantly, you need to verify its outputs—otherwise, you will never know if they contain inaccuracies or omissions.

Here’s the real issue: most people don’t question AI’s results. Few take the time to doubt its output, let alone verify the facts. Worse yet, people often rely on AI for tasks they aren’t well-equipped to evaluate or fact-check. Even researchers sometimes fall for misleading outputs—whether due to genuine mistakes or external pressures like funding.

And here’s another challenge: evaluating the overall performance of an AI system requires significant effort. AI’s output space is vast—far beyond what anyone can reasonably test in a single chat session. This is why AI tests are often inconclusive. The sheer number of possible outputs makes it impossible to assess every scenario, leaving gaps in our understanding of its reliability.

We also need to remember that the transformer model was originally designed for text completion tasks. Today, due to hype and marketing, we test it in conversations, ask it general knowledge questions, and even challenge it with reasoning—all far beyond its original purpose. It’s not that AI is failing. It was never designed to do the things we are told it can do.

It all started with OpenAI rebranding the transformer as “AI intelligence”. The rest is history.

Expand full comment

Sam

Feb 5Edited

And wow, you are not kidding. I just ran the same prompt using my subscription model (4o) and the results are crap. States missing. Lists a state as formerly missing that wasn't. Data are not properly sorted. And who knows about the values for median income...

Expand full comment

Forrest

That book, Extraordinary Popular Delusions and the Madness of Crowds, is public domain! You can read or download it free on Wikisource. https://en.wikisource.org/wiki/Memoirs_of_Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds/Volume_1

Expand full comment

Marko T. Manninen

That happens constantly when generating code with GPT4o. It drops out functions and data every while and then, apologies, and continues doing so. Artificial Dementelligence.

Expand full comment

Bruce Cohen

I evaluated MicroSoft Copilot as a coder and was deeply underwhelmed. In one test it correctly coded half of a depth-first tree walker but left out the other half. In another test it simply refused to generate any code, telling me to look up an application note.

Expand full comment

Marko T. Manninen

On that side they resemble more and more human, but not so much intelligence.

Expand full comment

Nothing worse than a chatbot with AI-titude

Expand full comment

Simple John

Gary, once again, thank you for your public service surfacing the bullshit around AI and that is AI.

Today's post felt super fresh. You have finally adopted sarcasm, hopefully forever.

My credentials for your readers:

I won multi-state math contests junior and senior years of high school.

I was anecdotally top of my class in applied math at Harvard. What impressed the faculty was my speed at solving problems.

My very early assessment of AI and AGI.

It was an index into written materials that contained some knowledge and many times as many opinions, nay agendas, most likely to advance careers.

I've never given a prompt. I doubt I ever will.

AI conveniently can also stand for Automated Inference, or for politicians Avoiding Involvement.

Thanks again for the sarcasm. At least today I feel we're in tune.

I continue to rebut your suggestions that an AGI is just some insight away.

Thought experiment - If you, or the meanest person on earth, or the nicest person on earth, had an AGI, what could they do with it to support their agenda?

I'm believing they would have to hook it up to the nuclear control network on the mean side or the MMT, use money to help people polity on the nice side.

Otherwise, it would be seen as a babbling genius.

We know how to kill the human race and we know how to give everyone the best chance to survive.

A genius AI isn't going to tell us what we don't already know.

Expand full comment

Philip

Feb 5Edited

I tried the same with o3-mini and I dont see major mistakes. It added territories for Canada on the second prompt but other than that it seems all right. https://chatgpt.com/share/67a2c9e7-e8ec-8012-a3e8-90e08a369715

Expand full comment

https://chatgpt.com/share/67a3739e-a970-8007-9208-87f70b009089

Feb 5Edited

There's also another known issue that o3 still has:

What actually amuses me is the reasoning, that a friend going to Portugal somehow gives a context for the question.

I also tried some logical tasks from local math olympiads in my country and though it was able to provide the correct answer, the reasoning by itself was far from perfect, at times unintelligible and redundant - a person doesn't reason like that.

Expand full comment

[1] https://ai-cosmos.hashnode.dev/unveiling-the-ai-illusion-why-chatbots-lack-true-understanding-and-intelligence

Gerard

Feb 9

That’s awesome! You made me smile.

This is an example I created a while ago to demonstrate how prompts work through activation rather than understanding. You can find the original explanation in this article [1].

I made it fail using a technique I call “activation bombing.” This is a type of prompt manipulation that leverages the attention mechanism to achieve a specific outcome—in this case, influencing the result. This method can be applied to any prompt by understanding how the transformer model works at its core.

Expand full comment

Feb 10

Haha, nice to have the original source! I read about this example elsewhere but as far as I can recall now they cited you.

It's a good example and as I've said the LLM's 'reasoning' that they now show is actually very funny because it is not logical, it says, 'The context hints that the answer is Ronaldo'.

Expand full comment

Gerard

Feb 10

That is exactly how it works. This process is more about stochastic aggregation than reasoning. You can think of it as a probability balance between Messi and Ronaldo that shifts as more tokens (subwords, activations) accumulate around one or the other. References to Lisbon, Portugal, or any related concepts subtly influence the overall stochastic weight in the output. Once you grasp this, it becomes quite intuitive.

For example, if Cristiano played for a hypothetical team called “Banderas”—a name that also belongs to a well-known actor frequently mentioned in the training data—the high frequency and association would create a strong “activation bomb,” heavily tilting the scales.

It is unfortunate that companies like OpenAI and much of the AI industry—including researchers and academia—allow this kind of misrepresentation to persist. Instead of pushing back against the paternalistic narrative of “intelligent”, “thinking” AI, they reinforce it, as if people couldn’t handle the truth. In reality, AI is just sophisticated pattern recognition, and that is impressive enough on its own without the need for exaggeration.

Expand full comment

Feb 13

Thanks for elaborating!

As for me, such details on where these AI models fail are more important for understanding the current state of affairs than all the benchmarks and examples of maths problems solutions they were trained to parrot. Because this really shows they are not reasoning and lack true understanding.

Expand full comment

https://chatgpt.com/share/67a37263-8684-8007-941f-34a9f2322159

Feb 5Edited

Well, but it still fails a very simple question:

This is o3-mini. I didn't tell it that the man can take only one item during each ride, it never cared.

Expand full comment

I don’t know if Gary Marcus genuinely doesn’t get that all LLMs aren’t the same or if he is being purposefully dishonest.

Expand full comment

notturno14

Just ran your queries through DeepSeekR1. In addition to the cool CoT reasoning, it did not make these mistakes.

Expand full comment

Nigel K Tolley

Yes, I've been very impressed with Deepseek. It is better than the paid for engines! And (a reduced version) runs on my laptop.

Expand full comment

hexheadtn

I've been waiting for a summary like this! The hype is truly out of control with all sorts of dubious claims. Most commentors are reading the same information and accepting it at face value. I was in the AI/machine learning world from 2000-2018. I was using reinforcement learning for finding associations between DNA/RNA and a label 0/1 for disease status. I like the article for emphasizing trust. If I get reams of output, do I have to double-check everything. Like the expense report example, reality is quite different from the words of the prophets.

Expand full comment

Tyler Corderman

A beautiful piece. Thank you.

Expand full comment

Charles Fadel