167 Comments
User's avatar
Gary Marcus's avatar

dang I knew i had seen something else recently that i meant to include. now added as an update on #3.

Addison Rich's avatar

Last week there were also two damning (imo) reports that found systematic poor performance (across many models and languages) of LLMs as news summarizers:

https://theconversation.com/i-used-ai-chatbots-as-a-source-of-news-for-a-month-and-they-were-unreliable-and-erroneous-268251

Which found that 1) LLMs accurately summarized news less than half the time (47% accurate

2) That LLMs mostly output incomplete, nonexistent, or non-media sources (legitimate URL was provided only 37% of the time)

And 3) my personal favorite - erroneous conclusions bordering on misinformation:

"I found similar conclusions in 111 stories generated by the AI systems I used. They often contained expressions such as “this situation highlights,” “reignites the debate,” “illustrates tensions,” or “raises questions.” [which I find absolutely hilarious, reignites the debate and illustrates tensions between who ChatGPT?]

And "In no case did I find a human being mentioning the tensions or debates reported by the AI tools. These “generative conclusions” seem to create debates that do not exist, and could represent a misinformation risk."

And the second report was a comprehensive (22 public service media companies across 18 countries and 14 languages) which found that:

“Some key findings:

Almost half of all AI answers had at least one significant issue.

A third of responses showed serious sourcing problems.

A fifth contained major accuracy issues, such as hallucinated and/or outdated information.”

Addison Rich's avatar

Oh my mistake, the second report is actually from October 2025 so you've probably seen it

TheAISlop's avatar

Grok undressing image and headline???

Nick Gallo's avatar

where do LLM code generation/code agents fall?

Michael Glenn Williams's avatar

How detailed of a response are you looking for here? They generate bad test coverage, they generate different algorithms for the same problem so code that should be the same from instance to instance looks and functions differently. They suggest solutions that ultimately do not work. They ignore constraints and rules you give them in their instructions/resource files. Lots more.

Sarah Smith's avatar

They don’t abide by DRY or YAGNI. They’ll add code that does the same thing as some utility method in another part of the code base, for example. No amount of context window seems to help. They’ll add bloat by for example testing for error cases which an SDK for a 3rd party framework already manages. Much faster to type it myself.

Jared's avatar

The fundamental problem with generative AI is that the creators overpromised. The underlying technology itself is fine. There is a real business here.

It is just not what the creators are saying it is. It is substantially smaller and less significant.

Gary Marcus's avatar

partly agree. there was a massive problem in overselling the technology. what i would say is that the underlying text is has its uses (eg coding and brainstorming) but it’s never something you can trust (like a calculator or a ruler) and that’s not entirely fine. and it does a very poor job indicating what portions of its answers are trustworthy; also not fine. but you are right that there would be fewer problems had it not been so wildly overhyped.

Fred Malherbe's avatar

“…it’s never something you can trust”.

I learned this the hard way, that these machines fabricate answers wholesale, including long and complex narratives with quotes from newspapers, magazines, books and court records that look totally convincing and are all lies, total hallucinations. The machine used real names of real authors to make its lies look more plausible.

It took me two full days to debug an article I wrote from DeepSeek hallucinations, layer upon layer of them. It was horrific. By the time I was finished, I honestly did not know what was reality or what was hallucination in my own head. The totally fake narrative DeepSeek had produced for a very real story I was writing had infected my brain. I should have guessed what was going on, when it filled in all the gaps in my story so perfectly.

You do this to me once. I have a very visceral memory for this kind of thing. It is literally not possible that I will ever trust any of these bots again, no matter what they do to improve their performance.

Trust is a very fragile commodity. You can spend years building it up and then lose it all forever in just one moment. I had been a fan of DeepSeek until this happened, I liked its sense of humour. Then I went back and looked at the chats I had saved down, thinking there was good material there — and found it was all riddled with lies and fake quotes and fake sources and hallucinations. Suddenly DeepSeek did not seem so funny.

My prediction for 2026 is a guaranteed series of AI disasters big and small, as more and more people find out the hard way — it’s never something you can trust.

ExceptionFatale's avatar

I feel this one deeply. I had been using ChatGPT to parse multiple hours long livestream transcriptions, when I noticed one day that the paragraph overview was completely fabricated. Undeterred I generated 3 separate 12 digit alphanumeric codes that I placed at three different points in the next transcript, I uploaded the transcript and said "Before proceeding; provide all 3 12 digit codes" and hoo boy.

It had tried to give me 3 fabricated codes 6 separate times. I almost tossed my mouse out the window. I was shaken up, and in an attempt to stop my own spiraling mental state, I went to Gemini. I uploaded the transcript and asked for the 3 codes. I received another 3 sets of fabricated codes. I didn't bother going to CoPilot as its built on ChatGPT. I did go to Claude as a last stop. Again I uploaded the transcript, and I waited. The actual 3 codes appeared, finally. But I had learned a SERIOUS lesson that day.

Fred Malherbe's avatar

They say AI makes you “productive”. I’m a professional editor, fact-checking is my life. I discovered the very hard way that you have to check every single tiny little thing you get from these bots. You can literally trust nothing.

And they are so chatty. They spew and spew and spew, so it takes you forever to try weed out what is useful and what is outright lies.

Who on earth needs a tool that lies to you? This is productive? Please.

ExceptionFatale's avatar

Absolutely! I lost weeks of work because everything I had previously parsed and archived away has to be redone, the data is all shot. I can't chance it even if 90% of them are correct - everything needs to be redone. This is a HOBBY for me, I can't even imagine how devastated I'd be if it were tied to my livelihood!

Who on Earth needs a tool that lies to you? This is it - 100% It scares me to my core to think of these things being put in charge of far more important, necessary aspects of life. It's not going to end well!

keithdouglas's avatar

Coding? Someone at work wrote a coding assistant using one model or the other. I asked for RBAC to be put on a simple program; it would not do it and it took three tries before it removed the secrets from the source code provided. This is not a specialized model; alas I have no ability to check this hypothesis against matters like one of those, but still: IMO one needs a disclaimer against use here. (To steelman a bit more; maybe 'coding' doesn't mean full software dev, but the boundaries there are hard to negotiate too.)

Gary Marcus's avatar

yep the coding tools are only viable because professional coders are good at debugging, and they are far from infallible

Oleg  Alexandrov's avatar

Debugging other code, especially AI code, takes a lot more time than writing the code from scratch. People who only vibe-code rather than assigning to AI low-risk but boring stuff will get burned.

Gianni Berardi's avatar

I am not a computer scientist, but with good basics of Python, I was able to revolutionize my approach to the analysis and study of markets thanks to Claude. It's true that everything depends on effort, patience, and above all, knowledge of the subject, which allowed me to learn code testing. But with that said, if I were a skilled programmer, I wouldn't use it the way I do; for example, in the software company where my wife works (one in Silicon Valley), AI isn't even used for testing; they are starting to integrate some routines. My question is: are Chinese companies aware of these limitations?

Bat's avatar

the application to coding is basically Long Autocomplete. We already had autocomplete, which makes suggestions based on statistical inference and an understanding of the structured nature of code. And like text autocomplete, it’s right about 10-30% of the time. The primary value I find is that it saves some time typing, and occasionally suggests functions or other things I wasn’t thinking about. It’s useful, but not revolutionary and not going to turn non-coders into coders.

Oleg  Alexandrov's avatar

No. It used to be auto-complete till 6 months ago. Now AI can do operations across your whole codebase. It actually understands and modify existing code in ways it was not possible before. Highly useful, but used in small targeted doses, with frequent commits and testing.

Coalabi's avatar

LLM-based AI *DOES NOT* understand, no matter what. Period.

Oleg  Alexandrov's avatar

Understanding is a consequence honest modeling. For code logic, a lot of the understanding is symbolic, and can be emulated well-enough with sufficient examples. Not fool-proof, of course.

Oleg Alexandrov's avatar

Coding tools work best when a human is in driver's seat and the AI is told to make small things per spec. Then it is of great value.

Jonas Barnett's avatar

I'm not sure about the coding thing. I know an organization that deals with health information. Now, these organizations don't typically have a ton of money. So, they've been asked to do more with the aid of AI. One particular group in the organization has been asked to use AI to help with information-processing coding, since most people are not experts in Python. Of course, not being experts means they are ill-equipped to debug the code or even to determine whether there are any subtle issues in the generated code. In my work with AI as a coding assistant, I've seen it place functions in the wrong layer, add detritus code that serves no purpose, or even remove code necessary for the specific coding context to handle threaded operations.

I see this phenomenon all the time. The things generated by AI do the job, as far as people who are not experts in the subject matter can judge, so they believe the tool is perfectly fine. There is no arguing against that. Despite all the evidence that AI goes wrong, people don't believe it applies to the task they are using it for.

I sometimes tire of being the contrarian. I also worry that, at some point, something innocuous will be inserted by AI and have an outsized negative impact if we don't keep fighting for sanity in decision-making related to AI use.

Jared's avatar

That’s because, as you point out, it shouldn’t be used as a calculator. It shouldn’t be used in mission critical output.

It is an improvement to search. It is useful as a demo creator. That’s a real business. It is much smaller than what people like Sam and Elon say it is, absolutely. There are many things it can’t do. But that’s something that real businesses can be built around.

C. King's avatar

Gary and Jared: beyond mere excitement about the creative process, money, the promise of acclaim, and getting ahead in one's field, that easily lead to over-hype, PERHAPS is also the deeper match-up between (1) a severely limited state of the purveyors' consciousness itself and their expectations of it and (2) the expectations, or lack of them, accorded their product.

And Then It Fell's avatar

The underlying technology is not "fine" by any stretch of the imagination. It's a cancer on society. GenAI was built on stolen data and now threatens the livelihoods of the very people whose work was hoovered up without their consent—illustrators, animators, musicians, copywriters, translators, journalists, and so on. And what do we get in exchange for the loss of work in creative fields? A flood of slop. A blight of brainrot. New frontiers in academic dishonesty, digital addiction, deepfake pornography, disinformation, and propaganda.

I don't have a crystal ball, but I just can't envision any plausible future in which ChatGPT, Midjourney, Suno, Sora and similar technologies yield a net benefit for humanity. And that's not even touching on the other types of "AI" technologies that are already being used for surveillance, price fixing, warfare, and a host of other malign purposes.

It's a shame that all the oh-so-clever folks who created this technology weren't wise enough or decent enough to know better.

Alex Tolley's avatar

There are so many problems that we have created and cannot/won't fix due to the same behavioral issues of humanity.

This long-ago quote still seems relevant today:

“It is difficult to get a man to understand something, when his salary depends on his not understanding it.” - Upton Sinclair

Jared's avatar

You basically just proved my point

And Then It Fell's avatar

How so? Your point seems to be that "AI" (scarequotes because it's such a fuzzy term) can't accomplish the grand feats that were originally promised but still has some uses. My point is that most of the uses are bad.

Matthew Kastor's avatar

How is there a business in producing untrustworthy results at far greater expense than just making things up? Who will pay the actual cost of producing nonsense? It's not going to be the customer, because they have a more efficient use of capital. It's not going to be the companies building the Ai... for long.

Nick Gallo's avatar

agreed. It's information retrieval not thinking

Amy A's avatar

The real business is fine, but likely economically and environmentally no good unless they shift to smaller models.

Coalabi's avatar

No, the business is not fine, as the technology cannot be trusted and it cannot be fixed because the flaw is at the source: "not understanding". Why is it so promoted then? In my opinion, (1) because very much money has been put on it, (2) because if enough people buy into it, they will become strongly dependent (a bit like with the main cloud providers today (sure, one can design a cloud application that is portable but it's harder and not necessarily cost efficient), and (3) because all the money required implies that only very big/heavy players can compete in the field.

When it goes beyond augmenting or assisting humans, AI poses an existential risk - not mainly because it could become nasty, but because it results in the loss of skills that humans should stay proficient in for resilience (yes, datacenters may get out of service, the more so in the tense geopolitical climate we are living in). Personally, I cannot imagine that people want machines to summarize for them, or think or decide for them (and, no, if you still have to review the process and its outcome thoroughly, you are not gaining anything, end to end).

Amy A's avatar

I agree with all of this. There are limited applications where the output being messy is okay, they just don’t work out in the way they are being sold (for all the reasons you mentioned and more). I find that people who are convinced of the value need you to acknowledge that there are some benefits to have a conversation. And we need the conversation.

Amy A's avatar

It’s slightly maddening for those of us who’ve known all along that it still isn’t clear to absolutely everyone else. WaPo has a headline today about a study from freaking ScaleAI that says genAI can only do 2.5% of remote worker tasks. And that 2.5% is likely high given the source of the study! And it’s remote tasks, not even real messy jobs!

Matthew Kastor's avatar

Imagine how much money they've wasted already, trying to replace employees they consider "low skill" and won't pay well.

This is how you know that at the top, the numbers are all made up and the outcomes are lies.

Oleg Alexandrov's avatar

You have known all along? The industry keeps on growing and the models keep on getting smarter. The problems are immense, yes, and it will take time, but the rate of improvement has never been better.

Costa's avatar

I would like add one more comment. The consequences of this technology are very insidious. It took 10+ years to see that cell phones and social media affect negatively the minds of the young people. Teachers see this because the new generations are worse and worse. New generations cannot stand corrections and they don't have grit, they exhibit entitlement and they break down easily upon hearing the slightest negative feedback. A high school student had to be taken to the hospital to be given sedatives due to anxiety produced by school homework and courses. That's where we are.

Generative AI will complete the job. I guess, as a cheerleader for generative AI, you kind of brush this off like the tech bros. Perhaps you don't have kids. Who cares about these children, anyway?

Regal J. Lager, PhD in Ball's avatar

Anyone that is pro-AI is either a moron or just a bad person. EVERYONE should be fiercely anti-AI. EVERYONE.

Coalabi's avatar
9hEdited

This is probably going too far in the assessment but I would say that he/she is very shortsighted. A bit, like cattle or poultry that is happy to see their human caretaker delivering the food every day and who, one day, will bring or sell them to the slaughterhouse ...

Oleg  Alexandrov's avatar

Here you are confusing tech with culture. Totally different things.

RCThweatt's avatar

So, tech doesn't affect culture?

Oleg  Alexandrov's avatar

I am not interested in discussion of arguments where whatever tech that comes along is being blamed for all ills of the world. Maybe that's true, maybe it's not, is orthogonal to what my point and this whole thread and article has been about.

Costa's avatar
2dEdited

That's the difference between intelligence (or folly 😁) and wisdom.

And Then It Fell's avatar

This is the silliest thing I have read in 2026.

Costa's avatar
2dEdited

It may improve technically, but not in the right direction, which makes all this "improvement" useless. Right direction means: objective truth, not the truth bent at the will of the tech bros & politicians de jour. As an example: some simple biological truths. If you query now chatgpt or your ai engine pick if a man can become woman or vice-versa, it is riddled with political crap. Garbage in and garbage out, probably 99%. What about AI chat bots masquerading as psychotherapists and pushing over the edge sensitive people? The fact that 1 person committed suicide following the AI prompts it is too much. This technology should have never been released to the public. You are blinded by the "coolness" factor and forget that in real life it can have disastrous consequences for some people. The list continues. It is a long one. Generative AI should crawl back into the hell hole where it came from. One can only hope.

Oleg  Alexandrov's avatar

Truth as preached by tech bros has nothing to do with AI approach. Even better or different algorithm would suffer just as much from how you constrain them.

Costa's avatar

You cannot separate the technology itself, now matter how fascinating it is from a math point of view, from the consequences it has in real world.

Oleg  Alexandrov's avatar

Of course. My point is different. Don't blame the tech. Blame the people. If tomorrow new tech comes around, and same people control it, it will be same effect.

My point is very narrow. The tech is getting better. The rest is a different discussion.

Costa's avatar
2dEdited

ok - I will give you a metaphor. Let's say you, Oleg, discover teleportation technology. It allows you to move around the world anywhere you want. It is a very cool technology and there are no health risks. What do you do? Do you release it to the world?

Srini Pagidyala's avatar

You're right Gary, even marquee platforms like Salesforce, Dell even Microsoft are unable to cross the chasm even after investing billions.

"Reliability" is the #1 issue with LLMs and that includes lack of predictability, adaptability, accountability on top of hallucinations.

Reliability is structural, architectural that can't be fixed by patches, governance, wrappers, agents, prompts or guardrails. This is also the reason why LLMs are failing enterprise adoption. Thanks for sharing.

https://srinipagidyala.substack.com/p/why-genai-and-llms-are-failing-enterprise

Matthew Kastor's avatar

For 7 trillion dollars I'll solve all your problems. You won't even have to learn how to ask. Just sit back and let me take care of it.

Xian's avatar

In almost every sector except AI, once the user base becomes large enough, the marginal cost drops close to zero or becomes very thin.

AI feels different. The more people use it, the more tokens are generated, the more compute is consumed, and the more energy is exhausted. Scale doesn’t flatten costs in the same way.

Just my two cents.

Matthew Kastor's avatar

They'll make excuses for why it can't run on redbull and chicken tenders, but at the end of the day my actual intelligence agents do, and they get things done that Ai never will.

Oleg Alexandrov's avatar

There was a 92% reduction per token between GPT-4 and GPT-4o.

Models no longer monolithic, and neither all available data is put in. There are very methods for optimization, and much work is outsourced to non-LLM when it makes sense, such us on tools.

So, things are actually working as in other industries. Larger user base means smarter models and cheaper models.

Granted, the industry is trying to solve much harder problems and reach more users, so overall investment is going up, maybe way more than it should.

RCThweatt's avatar

So, why the pell mell push to build more compute, and the power plants required (even onsite nukes are being advocated)? Which effort is meeting sharply raising resistance, likely enough to derail it. Any purported efficiency gain is obviously being swamped.

To say nothing of, is there any way to.make money generating tokens?

Oleg  Alexandrov's avatar

The amount of compute is going up because the complexity of problems is increasing, and there is a lot more demand for the tech. Yes, correct, the efficiency gain are being undone by so much more demand.

The revenue is going up to, but way behind the investment. Some investment is likely wasteful, though some companies like Anthropic are on track to be profitable in a couple of years.

RCThweatt's avatar

The “demand for the tech” is, to a significant extent, inorganic, more push than pull. I haven’t seen % analysis, but the anecdotal evidence is considerable. How many of such ‘induced’ users will be wlling to pay the actual inference cost? TBD.

Oleg  Alexandrov's avatar

It is a mixed bag. We all know that the tech is vastly overhyped. But, there was crypto, which was obviously a scam, there has been Tesla, an even bigger scam, and there are serious companies now like Google who do have applications, though the machinery is not yet mature. Likely a shakeout is in order.

Xian's avatar

Glad to know that!

Len Layton's avatar

Ed Zitron pointed out that we have seen no uptick of any kind in the number of new apps submitted to the Google or Apple app stores. If AI coding was so useful, shouldn’t we see that?

RCThweatt's avatar

That's been pointed out here, too, some months back. It is quite the tell, as in QED.

John's avatar

And yet why is it being shoved down everyone’s throats as if we don’t have any choice but to accept it as unavoidable to us? Jobs requiring people to constantly use AI tools, every new piece of tech including uncalled AI in it, AI boosters being the most annoying persons to have ever showed up in the surface of Earth. We’re all tired of GenAI and we know its impact is doubtful on productivity, but still I wonder: will we be able to live normally as a society again?

Marc Slemko's avatar

Another problem caused by too many billionaires farting.

Bryan McCormick's avatar

Scaling will do one thing. Sacrifice more farmland, create microclimate issues, and make more heat than light. If we are very lucky as a society the broligarchs will drop these ridiculous data center plans before more millions of chips needed elsewhere end up rotting in warehouses.

Oleg Alexandrov's avatar

Yeah, yeah, the latest evil scapegoat.

Bryan McCormick's avatar

No- just the stupidest buy in to total hype producing no improvement but very real great pain.

Oleg  Alexandrov's avatar

The improvement is real. Good tools for people to use. Good profits for companies, eventually. But in early stages. Too much hype, and too much money invested, not all if it wisely.

Ryan Peter's avatar

“Broligarchs” - hahaha. That’s great!

Matthew Kastor's avatar

Is the world ready to admit that we can't replace humanity with autocorrect, not even if you add internet spice so it can form whole sentences?

Dan's avatar

One recent experience that feels illustrative to me. I asked a Legal LLM for a summary of “the Christian doctrine in the context of government contracts” The Christian doctrine is named after a case called GL Christian & Assoc. and, in a nutshell, says that if the government forgets to include an important required clause in a contract, the law will read it in regardless. It’s a pretty simple and straightforward principle of government contracting. What I got instead was a bunch of babble about freedom of religion. Now, I had specified that I was asking in the context of government contracts, and there’s no such thing as “the Christian doctrine” in the context of religion. But obviously the LLM saw the word Christian, interpreted it as the religion, and guessed I was talking about freedom of religion. It’s a mistake no actual lawyer would make (or so one hopes). Interestingly, when I run a Google search for the Christian doctrine, the first search result is from a government contracting website, but the “AI Overview” is also about religion.

Geoff Anderson's avatar

I remember a year and a half or so ago, one of the LLM's was arguing with me about Tim Roth being in Galaxy Quest, and it wouldn't back down from that assertion (I know, trivial example, but it is one that sticks to me, as the simplest web search would verify that Tim Roth wasn't in the cast...)

RCThweatt's avatar

I tried AI Overview on a "how to" topic in which I'm expert, it was mostly right, but with significant errors due to logical/textual inconsistencies and a flat physical impossibility. There was, indeed, no model at work. It would have been useless, or worse, to a naive user.

I could have edited it and produced a text rather than writing it all myself, I guess. It pulled up good diagrams and illustrations (which the text contraducted). But Wikipedia looks like a better general use case.

Coalabi's avatar

I have also similar experiences that I call "AI arrogance": you know what you are looking for and AI insists that you are on wrong path and tells you what you "more than probably" mean. It happened for example when looking for the location of an old (WW2) photograph in the north of France. There were some textual hints (billboards, shop names) on the photograph and some were occluded (among which a "Dubonnet" [well-known liquor] billboard of which only "..bonnet" was visible). Google AI, to which I had added "Dubonnet" to refine the visual search, insisted that it was not "Dubonnet" (that, it insisted, it "knew" what it was, but in this case "jbonnet" - that said it nothing anyway). I have other similar firsthand examples of AI stubbornness ...

Ian [redacted]'s avatar

I'm totally onboard with the idea that putting all of the economic eggs into the AI basket is a bad idea and bad economic policy. I think there are a lot of cool and interesting things it does that are totally compatible with memorizing how to do things (building basic websites for people who don't want to pay Squarespace $50/month, helping devs like me build annoying-boilerplate much faster) or the generative video/music stuff that I could see disrupting industries.

I wish everyone could just be blasé about it and let it evolve as needed instead of the hype bros making it a toxic subject. I don't like being lied to by hype

C. King's avatar

Ian: A good analogy: Boilerplate.

Rob C's avatar

So far it seems like a colossal waste of resources which has produced a few useful tools, which cost 100x more to run than they can make in revenue.

All to justify the existence of a few tech companies and in the service of “growth”.

Hessie Jones's avatar

I feel like we've opened up a Pandora's box. Canada's response to the promise of generative AI is to build more data centers, shift budgets to double electricity capacity by 2050, and advance our Ring of Fire project to accelerate mining of our critical minerals — if this isn't FOMO, I don't know what is.

Gerben Wierda's avatar

The people driving the over-the-top expectations are either uninformed, stupid, or they are grifters. Or a combination of these.

GenAi is here to stay, mostly as the category ‘cheap mental output’ but that depends too on the — unanswered — question if there are ‘good enough’ areas that are factually cheap enough to produce for the quality they deliver. Scaling has so far proved to be of limited success and has driven inference cost up enormously. We probably need orders of magnitude more efficiency.

I find LLMs regularly pretty useful (once, twice a week, depending in activity) but only in areas where I already have knowledge, and almost only as a more efficient way to search actual human-produced source material. My sort of use is not going to be profitable for the vendors.

Coalabi's avatar

Well, Gerben, I don't fully concur (but "YMMV"). People usually consider that I'm quite skilled at writing (probably because my PhD thesis promoter let me write my 200+ page thesis thrice :-(, despite the first version being already quite good - for sure, the third one was close to perfect ^^ but my wife who was doing all the typesetting was getting mad at him).

In my company, we are more than encouraged to use an internal AI tool (based on mainstream LLM's) for many kinds of tasks. Recently I had written an answer to a question for a bid. We were allowed 3 pages; I came up with 1,8 page - so, no problem -. A younger colleague of mine, "just in case", submitted my answer to our AI tool: the result was "only" 2/3 page long, most of the elements of my (structured) answer were there but there was no structure anymore, nothing was stressed or highlighted anymore ... A total loss, for me. To her, it was beneficial "because we had reduced the answer to half of its original size" ... but there was no need to !!

My wife has similar experiences with colleagues who use AI to produce minutes of meetings from recording transcripts: no highlights, hallucinations, ... Where are the gains there?

For the rest, indeed, I confess: I sometimes try AI to get a quick pitch, reword a paragraph, but I find the outcome usually mitigated at best ... When you tell them, they say: "Oh yeah! Sure! Of course you have to check thoroughly" (well, then, you'd better write the minutes yourself upfront ...)

Gerben Wierda's avatar

I never use LLMs to write for me, only to sometimes as an early step to (re)search something. I’ll be co-author of a paper coming out in March that argues strongly against using LLMs to outsource writing to in education.

I have written a reasonably well-read piece about the limitations of summarising (which is also writing): https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actually-does-nothing-of-the-kind/

So I guess we agree here.

Coalabi's avatar

Indeed, Gerben, we agree ;-)

Mehdididit's avatar

AI can never be a good source for health information. All of the accurate information is behind a credential wall- only accessible to medical professionals. That’s a whole topic in and of itself, but not for here.

I have a rare autoimmune disease and was looking for information on it. This was a few years ago before AI had completely invaded search engines. I got horrible information and made myself much worse. At no point did it mention that it was giving info on autoimmune disease in general, which, in my case, is the exact wrong thing to do. Luckily, it was reversible.

I’ve since removed google from my phone (as much as possible) and switched to Duck Duck Go, which allows you to turn AI responses off. Plus it’s a company with some moral standards.

Andy's avatar

What about ChatGPT now solving ‘unsolvable’ Erdős problems almost daily, as recently confirmed by Terrence Tao? GPT-5.2 has now cracked Problem #728, #729, and 397, all confirmed and verified by professional mathematicians. https://mathstodon.xyz/@tao/115855840223258103

DeludedProphet's avatar

Yet it can't play chess, being unable to make even the most basic moves (although it will talk a good game). Machines have been playing chess successfully for nearly 50 years...but not chatGPT, despite the bottomless resources invested in it. If it can't play such a simple game...

Andy's avatar

Clearly you haven’t tried it lately. ChatGPT 5.2 PRO would almost definitely beat you. Want to test that?

DeludedProphet's avatar

By all means. I'm on 5.2 and tried this test at the end of November '25. I gave it the following criteria:-

a4 a5 Nf3 Nc6 d4 exd4 Nxd4 Nxd4, it then suggested c2xd4.

Obviously, that's quite a challenging move for a pawn. Several thousand words of 'explanation' later it still suggested the same move, even though it agreed it was illegal.

It's a deliberately simple, highly limited test with a bar low enough that decades old, obsolete programming could pass it.

Andy's avatar

I think you are using the 'lobotomized' version (the free ChatGPT - which is useless, that's why it's free). I gave your comment to the $20 per month model (the thinking version) and here is it's response. You tell me if it makes any sense to you:

"

I get what you were testing: a bare-minimum legality check where the model just has to track pieces and basic pawn rules. I am assuming the usual prefix moves e4 e5 already happened, since …exd4 is otherwise illegal from the initial setup.

After:

a4 a5

Nf3 Nc6

d4 exd4

Nxd4 Nxd4

it is White to move in a Scotch Game type structure (open center, early exchanges). “c2xd4” is illegal because the c2 pawn can only capture to b3 or d3, not d4.

The correct recapture is 1. Qxd4 (the main-line Scotch queen recapture) or 1. Nxd4. If 1. Qxd4, the standard idea is that Black gains a tempo by hitting the queen with 1…Nf6, and White answers with 2. Nc3 to develop and hold the center. A common follow-up for Black is 2…Bb4 or 2…d6 to finish development.

"

DeludedProphet's avatar

That is definitely better, thank you for taking the time to run it through. How are you finding Pro? Worth the investment? Have you specifically tested it, or are you using it as BAU?

Andy's avatar

This is actually the "Plus" model ($20 per month), which is the most useful option for ordinary users. The "Pro" model costs $200 per month and is the one being used to solve the Erdős problems.

I’m a scientist, and I find that paying $200 per month is like having a few very bright graduate assistants who are extremely fast. The average response time from the Pro model is about 10 minutes, but it can think for an hour on some of the more difficult problems and is available 24/7.

And since I never trust the models (or my graduate assistants), I have a solid workflow to check and validate all outputs. The accuracy of the Pro model is honestly off the charts compared to my human grad assistants.

Notice that Gary never talks about the advanced models - always about the lobotomized useless ones.

Lomklal's avatar

#397 has been solved before, but #728 and #729 has been solved by computer systems. I am eagerly but kindly waiting for Gary's response

User's avatar
Comment deleted
1dEdited
Comment deleted
Andy's avatar

That’s your (and Gary’s) problem: ‘If a chatbot solved it, then it wasn’t unsolvable.’ No matter what happens, critics will never admit that chatbots can outperform humans in some (soon most) domains.