The claim of “its getting better” is reminiscent of the guy in Monty Python’s Life of Brian who claimed “she turned me into a newt” but that he “got better”
Of course, we need to take into account that a free version should not be burdened with a lot of expectations (but we know the non-free versions have the same architecture at its core, so will suffer from the same, but in different ways).
Anyway, this reminded me of Ada Lovelace with her warning against computing hype when general computers were still only an idea and a century before they became reality (~1842 ):
"It is desirable to guard against the possibility of exaggerated ideas that might arise as to the powers of the Analytical Engine. In considering any new subject, there is frequently a tendency, first, to overrate what we find to be already interesting or remarkable; and, secondly, by a sort of natural reaction, to undervalue the true state of the case, when we do discover that our notions have surpassed those that were really tenable."
Oh and lest you think that while the "free" versions are the only one's flawed, no. The fancy, expensive ones from supposedly hallucination free legal services like LexisNexis are STILL hallucinating wildly, 17-34% of the time - according to a recent audit by Stanford. Attorneys foolish enough to fire their paralegals and rely instead upon a LexisNexis or Westlaw AI agent are finding themselves in peril when the judge notices that the case presented as precedent was entirely made up by the AI machine. In a new preprint study by Stanford RegLab and HAI researchers, we put the claims of two providers, LexisNexis (creator of Lexis+ AI) and Thomson Reuters (creator of Westlaw AI-Assisted Research and Ask Practical Law AI)), to the test. We show that their tools do reduce errors compared to general-purpose AI models like GPT-4. That is a substantial improvement and we document instances where these tools provide sound and detailed legal research. But even these bespoke legal AI tools still hallucinate an alarming amount of the time: the Lexis+ AI and Ask Practical Law AI systems produced incorrect information more than 17% of the time, while Westlaw’s AI-Assisted Research hallucinated more than 34% of the time. https://hai.stanford.edu/news/ai-trial-legal-models-hallucinate-1-out-6-or-more-benchmarking-queries
I’m a regular Nexis user and have noticed more and more errors in their data - certain company files stating revenue in the billions for tiny family owned companies (across all companies all at once). This error is an obvious one, but it makes it hard to trust the rest of their aggregated content….
I’ve never used Nexis, but I have used the paid version of ChatGPT and find that it does the same thing. I’m always cautious when I use it and only trust it about 50%. That’s not a blind trust though because I’m always vigilant in checking on what it tells me.
What’s interesting to me about this is the tendency of users to ask it to do a thing that saves tons of time and manpower, but then not even bother to do the most cursory sense check of the output or the citations. I’m probably guilty of it too on some level, but I at least give a basic read through for egregious errors or nonsense. There are many instances of college students copying and pasting answers to questions that contain language along the lines of “I can only handle part of this question because I am an LLM.” I believe language to that effect has shown up in at least one peer-reviewed scientific paper.
I have noticed that people who spend too much time coding and or interacting with AI output have greatly diminished brain power. Too much digitization - and taking the requisite drugs to keep up with the machines - rots the human brain and soul. Fortunately there is a solution. Unplug them. Caveat emptor.
Something that has been is my mind is: ChatGPT is learning with content of web; more and more content of web is made by ChatGPT (which has factual erros); wouldn't this create a cascade of erros?
This has been the subject of many posts, including by Marcus. And, yes, it's an enormous problem. Especially when the LLM is specifically trained on data sets that are influenced by propaganda actors. See DeepSeek's results on anything related to China, Uighurs, Tiananmen Square, Taiwan... if it answers at all.
Hi Gary, I am a frequent reader, but first time poster. I greatly appreciate your critical reviews, and firm grounding in reality. I also think that, for now, LLMs are great at memorising and pattern matching (a lot better than humans, obviously, given the absence of scale and scope limitations), but that is miles away from AGI. Sometimes it "looks like" AGI, but that's because the pattern matching statistically replicates human thinking patterns.
Anyway, I run your queries on other models, and....drum roll... o3 got it right at the first attempt, Gemini 1.5 Deep Research failed twice to complete any answer, it said "I'm just an LLM, can't help" 😉, Gemini 2.0 Experimental Advanced got it right, too.
It seems the models are improving, but, to make them more useful than harmful, we need to remember what they do.... Memorise a lot of stuff, and find similar patterns across. Useful, but not intelligent.
At this point, he’s just using the older and worse models. His line of argument wouldn’t work at all if he used more modern ones (especially COT models).
2 weeks ago I went to Claude, latest model. There were 3 sample test questions on the "suggested prompts", as usual. The middle one was the trivial "Whlch 1s bigger Test if Al knows which number is bigger." (AI text grabbed from a photo, excuse errors!)
> What is bigger, 9.9 or 9.11?
--
9.9 is smaller than 9.11.
9.11 has a larger digit in the tenths place (1) than 9.9. (which has 9 as the tenths digit and implicitly a 0 in the hundredths digit or 9.90).
--
I use chatgpt paid, too. It is also useless, in the same way I have to lead a school child to the right answer. It is less work, for something factual and requiring precision, to do it myself.
I didn't test Claude Sonnet 3.5, but o3 mini and Gemini 2.0 advanced got it right, while 1.5 failed. The point remains, that these can be extraordinarily useful tools, if used with awareness of their limitations. AGI, they are not.
Ironically, your comment unintentionally argues against itself. Why would something be a skill issue? Because the person, not the model, needs to have the skill to lead it to the correct answer...because the model does not actually give the correct answer!
Some of them had errors the ones best suited for the task had none. Practical LLMs of this type have only been around for a few years. Look at the early history of any technology. Expecting instant perfection is absurd and places far more faith in the technology than is in any way reasonable.
Great overview of current events with ChatGPT. It makes me wonder how so much misinformation spreads through AI. I think I have an answer. AI follows the same process every time to generate an output. This is a fascinating concept because it means AI will always produce results in the same way, regardless of whether it has been trained properly or not. In that sense, it functions like a calculator.
So how do we get misled? The problem lies in human interpretation. AI simply generates an output, but it is up to us to analyze it critically. This is where things get messy. To interpret AI’s output correctly, you need critical thinking and a solid understanding of the subject. More importantly, you need to verify its outputs—otherwise, you will never know if they contain inaccuracies or omissions.
Here’s the real issue: most people don’t question AI’s results. Few take the time to doubt its output, let alone verify the facts. Worse yet, people often rely on AI for tasks they aren’t well-equipped to evaluate or fact-check. Even researchers sometimes fall for misleading outputs—whether due to genuine mistakes or external pressures like funding.
And here’s another challenge: evaluating the overall performance of an AI system requires significant effort. AI’s output space is vast—far beyond what anyone can reasonably test in a single chat session. This is why AI tests are often inconclusive. The sheer number of possible outputs makes it impossible to assess every scenario, leaving gaps in our understanding of its reliability.
We also need to remember that the transformer model was originally designed for text completion tasks. Today, due to hype and marketing, we test it in conversations, ask it general knowledge questions, and even challenge it with reasoning—all far beyond its original purpose. It’s not that AI is failing. It was never designed to do the things we are told it can do.
It all started with OpenAI rebranding the transformer as “AI intelligence”. The rest is history.
And wow, you are not kidding. I just ran the same prompt using my subscription model (4o) and the results are crap. States missing. Lists a state as formerly missing that wasn't. Data are not properly sorted. And who knows about the values for median income...
That happens constantly when generating code with GPT4o. It drops out functions and data every while and then, apologies, and continues doing so. Artificial Dementelligence.
I evaluated MicroSoft Copilot as a coder and was deeply underwhelmed. In one test it correctly coded half of a depth-first tree walker but left out the other half. In another test it simply refused to generate any code, telling me to look up an application note.
Gary, once again, thank you for your public service surfacing the bullshit around AI and that is AI.
Today's post felt super fresh. You have finally adopted sarcasm, hopefully forever.
My credentials for your readers:
I won multi-state math contests junior and senior years of high school.
I was anecdotally top of my class in applied math at Harvard. What impressed the faculty was my speed at solving problems.
My very early assessment of AI and AGI.
It was an index into written materials that contained some knowledge and many times as many opinions, nay agendas, most likely to advance careers.
I've never given a prompt. I doubt I ever will.
AI conveniently can also stand for Automated Inference, or for politicians Avoiding Involvement.
Thanks again for the sarcasm. At least today I feel we're in tune.
I continue to rebut your suggestions that an AGI is just some insight away.
Thought experiment - If you, or the meanest person on earth, or the nicest person on earth, had an AGI, what could they do with it to support their agenda?
I'm believing they would have to hook it up to the nuclear control network on the mean side or the MMT, use money to help people polity on the nice side.
Otherwise, it would be seen as a babbling genius.
We know how to kill the human race and we know how to give everyone the best chance to survive.
A genius AI isn't going to tell us what we don't already know.
What actually amuses me is the reasoning, that a friend going to Portugal somehow gives a context for the question.
I also tried some logical tasks from local math olympiads in my country and though it was able to provide the correct answer, the reasoning by itself was far from perfect, at times unintelligible and redundant - a person doesn't reason like that.
This is an example I created a while ago to demonstrate how prompts work through activation rather than understanding. You can find the original explanation in this article [1].
I made it fail using a technique I call “activation bombing.” This is a type of prompt manipulation that leverages the attention mechanism to achieve a specific outcome—in this case, influencing the result. This method can be applied to any prompt by understanding how the transformer model works at its core.
Haha, nice to have the original source! I read about this example elsewhere but as far as I can recall now they cited you.
It's a good example and as I've said the LLM's 'reasoning' that they now show is actually very funny because it is not logical, it says, 'The context hints that the answer is Ronaldo'.
That is exactly how it works. This process is more about stochastic aggregation than reasoning. You can think of it as a probability balance between Messi and Ronaldo that shifts as more tokens (subwords, activations) accumulate around one or the other. References to Lisbon, Portugal, or any related concepts subtly influence the overall stochastic weight in the output. Once you grasp this, it becomes quite intuitive.
For example, if Cristiano played for a hypothetical team called “Banderas”—a name that also belongs to a well-known actor frequently mentioned in the training data—the high frequency and association would create a strong “activation bomb,” heavily tilting the scales.
It is unfortunate that companies like OpenAI and much of the AI industry—including researchers and academia—allow this kind of misrepresentation to persist. Instead of pushing back against the paternalistic narrative of “intelligent”, “thinking” AI, they reinforce it, as if people couldn’t handle the truth. In reality, AI is just sophisticated pattern recognition, and that is impressive enough on its own without the need for exaggeration.
I have also been very impressed by DeekSeek-R1 with its reasoning printouts. However in a few cases I did see errors in its reasoning process, which subsequently lead to incorrect final result. This calls into caution if one blindly relies on even the top of line llm reasoning models for result. I see reasoning on natural language word/token level being fundamentally unreliable and extremely inefficient and is not how humans conduct reasoning.
I've been waiting for a summary like this! The hype is truly out of control with all sorts of dubious claims. Most commentors are reading the same information and accepting it at face value. I was in the AI/machine learning world from 2000-2018. I was using reinforcement learning for finding associations between DNA/RNA and a label 0/1 for disease status. I like the article for emphasizing trust. If I get reams of output, do I have to double-check everything. Like the expense report example, reality is quite different from the words of the prophets.
I am concerned at the way people unquestioningly turn to these tools, and are prepared to explain away the glaring flaws.
"It's getting better..."
"It needs a clearer prompt...."
If junior staff made this many mistakes, consistently and without learning, they wouldn't last long in many high performing teams.
The claim of “its getting better” is reminiscent of the guy in Monty Python’s Life of Brian who claimed “she turned me into a newt” but that he “got better”
They wouldn't last long in low performing teams, repeating the same mistakes.
Of course, we need to take into account that a free version should not be burdened with a lot of expectations (but we know the non-free versions have the same architecture at its core, so will suffer from the same, but in different ways).
Anyway, this reminded me of Ada Lovelace with her warning against computing hype when general computers were still only an idea and a century before they became reality (~1842 ):
"It is desirable to guard against the possibility of exaggerated ideas that might arise as to the powers of the Analytical Engine. In considering any new subject, there is frequently a tendency, first, to overrate what we find to be already interesting or remarkable; and, secondly, by a sort of natural reaction, to undervalue the true state of the case, when we do discover that our notions have surpassed those that were really tenable."
https://ea.rna.nl/2023/11/26/artificial-general-intelligence-is-nigh-rejoice-be-very-afraid/
Oh and lest you think that while the "free" versions are the only one's flawed, no. The fancy, expensive ones from supposedly hallucination free legal services like LexisNexis are STILL hallucinating wildly, 17-34% of the time - according to a recent audit by Stanford. Attorneys foolish enough to fire their paralegals and rely instead upon a LexisNexis or Westlaw AI agent are finding themselves in peril when the judge notices that the case presented as precedent was entirely made up by the AI machine. In a new preprint study by Stanford RegLab and HAI researchers, we put the claims of two providers, LexisNexis (creator of Lexis+ AI) and Thomson Reuters (creator of Westlaw AI-Assisted Research and Ask Practical Law AI)), to the test. We show that their tools do reduce errors compared to general-purpose AI models like GPT-4. That is a substantial improvement and we document instances where these tools provide sound and detailed legal research. But even these bespoke legal AI tools still hallucinate an alarming amount of the time: the Lexis+ AI and Ask Practical Law AI systems produced incorrect information more than 17% of the time, while Westlaw’s AI-Assisted Research hallucinated more than 34% of the time. https://hai.stanford.edu/news/ai-trial-legal-models-hallucinate-1-out-6-or-more-benchmarking-queries
I’m a regular Nexis user and have noticed more and more errors in their data - certain company files stating revenue in the billions for tiny family owned companies (across all companies all at once). This error is an obvious one, but it makes it hard to trust the rest of their aggregated content….
I’ve never used Nexis, but I have used the paid version of ChatGPT and find that it does the same thing. I’m always cautious when I use it and only trust it about 50%. That’s not a blind trust though because I’m always vigilant in checking on what it tells me.
Mindshift
If you only trust ChatGPT 50%, why don’t you just flip a coin?
Because I’m a curious person and can’t help myself. I want to see what it’s like so I can understand what everyone’s talking about.
The problem is that I’m not talking about an AI tool. Regular Nexis files are getting corrupted because of over reliance on algorithms and AI.
It’s pretty apparent that I don’t know what a Nexis file is. I thought it was another Chatbot. Lol
The “Nexis of Evil” would probably be a good name for a chatbot
No worries, we can’t possibly know everything (and neither can AI, but I like humans better, lol)
we can’t possibly know everything “
We” who?
Speak for yourself
Thanks for the pass and I’m with you on the humans, but they too can sometimes be a pain in the butt..
Cool, thanks for sharing!
What’s interesting to me about this is the tendency of users to ask it to do a thing that saves tons of time and manpower, but then not even bother to do the most cursory sense check of the output or the citations. I’m probably guilty of it too on some level, but I at least give a basic read through for egregious errors or nonsense. There are many instances of college students copying and pasting answers to questions that contain language along the lines of “I can only handle part of this question because I am an LLM.” I believe language to that effect has shown up in at least one peer-reviewed scientific paper.
I have noticed that people who spend too much time coding and or interacting with AI output have greatly diminished brain power. Too much digitization - and taking the requisite drugs to keep up with the machines - rots the human brain and soul. Fortunately there is a solution. Unplug them. Caveat emptor.
Here's the spelling of "prince edward island" changed to match chat-GPT-o's claims:
"prince edwardo aisland".
🤣
That’s just how you spell “island” in AI-land
This sounds like something from a sea shanty.
Something that has been is my mind is: ChatGPT is learning with content of web; more and more content of web is made by ChatGPT (which has factual erros); wouldn't this create a cascade of erros?
Yep. Read up on "model collapse".
This has been the subject of many posts, including by Marcus. And, yes, it's an enormous problem. Especially when the LLM is specifically trained on data sets that are influenced by propaganda actors. See DeepSeek's results on anything related to China, Uighurs, Tiananmen Square, Taiwan... if it answers at all.
ironically the actual traning in deepseek isn't even that biased or censored at all, and you can easily make it crticise china.
Simply that the hosted version automatically removes all answers related to the PRC, by a simply checking words on the output.
If you tell it to replace a few letters here and there...
Actual training maybe biased but its nothing like the guardrails in say claude
This will sound naive, but do they not have ways of coping with that? Because this definitely sounds like a bad thing.
Hi Gary, I am a frequent reader, but first time poster. I greatly appreciate your critical reviews, and firm grounding in reality. I also think that, for now, LLMs are great at memorising and pattern matching (a lot better than humans, obviously, given the absence of scale and scope limitations), but that is miles away from AGI. Sometimes it "looks like" AGI, but that's because the pattern matching statistically replicates human thinking patterns.
Anyway, I run your queries on other models, and....drum roll... o3 got it right at the first attempt, Gemini 1.5 Deep Research failed twice to complete any answer, it said "I'm just an LLM, can't help" 😉, Gemini 2.0 Experimental Advanced got it right, too.
It seems the models are improving, but, to make them more useful than harmful, we need to remember what they do.... Memorise a lot of stuff, and find similar patterns across. Useful, but not intelligent.
At this point, he’s just using the older and worse models. His line of argument wouldn’t work at all if he used more modern ones (especially COT models).
That's nonsense.
2 weeks ago I went to Claude, latest model. There were 3 sample test questions on the "suggested prompts", as usual. The middle one was the trivial "Whlch 1s bigger Test if Al knows which number is bigger." (AI text grabbed from a photo, excuse errors!)
> What is bigger, 9.9 or 9.11?
--
9.9 is smaller than 9.11.
9.11 has a larger digit in the tenths place (1) than 9.9. (which has 9 as the tenths digit and implicitly a 0 in the hundredths digit or 9.90).
--
I use chatgpt paid, too. It is also useless, in the same way I have to lead a school child to the right answer. It is less work, for something factual and requiring precision, to do it myself.
I didn't test Claude Sonnet 3.5, but o3 mini and Gemini 2.0 advanced got it right, while 1.5 failed. The point remains, that these can be extraordinarily useful tools, if used with awareness of their limitations. AGI, they are not.
Which OpenAI model did you use. Honestly, it sounds like a skills issue.
Ironically, your comment unintentionally argues against itself. Why would something be a skill issue? Because the person, not the model, needs to have the skill to lead it to the correct answer...because the model does not actually give the correct answer!
Tell me you don’t understand current LLM technology without telling me you don’t understand current LLM technology
Well, you could just read the comments where people posted answers from other models, and either they or I identified the mistakes in them.
Some of them had errors the ones best suited for the task had none. Practical LLMs of this type have only been around for a few years. Look at the early history of any technology. Expecting instant perfection is absurd and places far more faith in the technology than is in any way reasonable.
Great overview of current events with ChatGPT. It makes me wonder how so much misinformation spreads through AI. I think I have an answer. AI follows the same process every time to generate an output. This is a fascinating concept because it means AI will always produce results in the same way, regardless of whether it has been trained properly or not. In that sense, it functions like a calculator.
So how do we get misled? The problem lies in human interpretation. AI simply generates an output, but it is up to us to analyze it critically. This is where things get messy. To interpret AI’s output correctly, you need critical thinking and a solid understanding of the subject. More importantly, you need to verify its outputs—otherwise, you will never know if they contain inaccuracies or omissions.
Here’s the real issue: most people don’t question AI’s results. Few take the time to doubt its output, let alone verify the facts. Worse yet, people often rely on AI for tasks they aren’t well-equipped to evaluate or fact-check. Even researchers sometimes fall for misleading outputs—whether due to genuine mistakes or external pressures like funding.
And here’s another challenge: evaluating the overall performance of an AI system requires significant effort. AI’s output space is vast—far beyond what anyone can reasonably test in a single chat session. This is why AI tests are often inconclusive. The sheer number of possible outputs makes it impossible to assess every scenario, leaving gaps in our understanding of its reliability.
We also need to remember that the transformer model was originally designed for text completion tasks. Today, due to hype and marketing, we test it in conversations, ask it general knowledge questions, and even challenge it with reasoning—all far beyond its original purpose. It’s not that AI is failing. It was never designed to do the things we are told it can do.
It all started with OpenAI rebranding the transformer as “AI intelligence”. The rest is history.
https://ai-cosmos.hashnode.dev/the-transformer-rebranding-from-language-model-to-ai-intelligence
And wow, you are not kidding. I just ran the same prompt using my subscription model (4o) and the results are crap. States missing. Lists a state as formerly missing that wasn't. Data are not properly sorted. And who knows about the values for median income...
That book, Extraordinary Popular Delusions and the Madness of Crowds, is public domain! You can read or download it free on Wikisource. https://en.wikisource.org/wiki/Memoirs_of_Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds/Volume_1
That happens constantly when generating code with GPT4o. It drops out functions and data every while and then, apologies, and continues doing so. Artificial Dementelligence.
I evaluated MicroSoft Copilot as a coder and was deeply underwhelmed. In one test it correctly coded half of a depth-first tree walker but left out the other half. In another test it simply refused to generate any code, telling me to look up an application note.
On that side they resemble more and more human, but not so much intelligence.
Nothing worse than a chatbot with AI-titude
Gary, once again, thank you for your public service surfacing the bullshit around AI and that is AI.
Today's post felt super fresh. You have finally adopted sarcasm, hopefully forever.
My credentials for your readers:
I won multi-state math contests junior and senior years of high school.
I was anecdotally top of my class in applied math at Harvard. What impressed the faculty was my speed at solving problems.
My very early assessment of AI and AGI.
It was an index into written materials that contained some knowledge and many times as many opinions, nay agendas, most likely to advance careers.
I've never given a prompt. I doubt I ever will.
AI conveniently can also stand for Automated Inference, or for politicians Avoiding Involvement.
Thanks again for the sarcasm. At least today I feel we're in tune.
I continue to rebut your suggestions that an AGI is just some insight away.
Thought experiment - If you, or the meanest person on earth, or the nicest person on earth, had an AGI, what could they do with it to support their agenda?
I'm believing they would have to hook it up to the nuclear control network on the mean side or the MMT, use money to help people polity on the nice side.
Otherwise, it would be seen as a babbling genius.
We know how to kill the human race and we know how to give everyone the best chance to survive.
A genius AI isn't going to tell us what we don't already know.
I tried the same with o3-mini and I dont see major mistakes. It added territories for Canada on the second prompt but other than that it seems all right. https://chatgpt.com/share/67a2c9e7-e8ec-8012-a3e8-90e08a369715
I don’t know if Gary Marcus genuinely doesn’t get that all LLMs aren’t the same or if he is being purposefully dishonest.
There's also another known issue that o3 still has:
https://chatgpt.com/share/67a3739e-a970-8007-9208-87f70b009089
What actually amuses me is the reasoning, that a friend going to Portugal somehow gives a context for the question.
I also tried some logical tasks from local math olympiads in my country and though it was able to provide the correct answer, the reasoning by itself was far from perfect, at times unintelligible and redundant - a person doesn't reason like that.
That’s awesome! You made me smile.
This is an example I created a while ago to demonstrate how prompts work through activation rather than understanding. You can find the original explanation in this article [1].
I made it fail using a technique I call “activation bombing.” This is a type of prompt manipulation that leverages the attention mechanism to achieve a specific outcome—in this case, influencing the result. This method can be applied to any prompt by understanding how the transformer model works at its core.
[1] https://ai-cosmos.hashnode.dev/unveiling-the-ai-illusion-why-chatbots-lack-true-understanding-and-intelligence
Haha, nice to have the original source! I read about this example elsewhere but as far as I can recall now they cited you.
It's a good example and as I've said the LLM's 'reasoning' that they now show is actually very funny because it is not logical, it says, 'The context hints that the answer is Ronaldo'.
That is exactly how it works. This process is more about stochastic aggregation than reasoning. You can think of it as a probability balance between Messi and Ronaldo that shifts as more tokens (subwords, activations) accumulate around one or the other. References to Lisbon, Portugal, or any related concepts subtly influence the overall stochastic weight in the output. Once you grasp this, it becomes quite intuitive.
For example, if Cristiano played for a hypothetical team called “Banderas”—a name that also belongs to a well-known actor frequently mentioned in the training data—the high frequency and association would create a strong “activation bomb,” heavily tilting the scales.
It is unfortunate that companies like OpenAI and much of the AI industry—including researchers and academia—allow this kind of misrepresentation to persist. Instead of pushing back against the paternalistic narrative of “intelligent”, “thinking” AI, they reinforce it, as if people couldn’t handle the truth. In reality, AI is just sophisticated pattern recognition, and that is impressive enough on its own without the need for exaggeration.
Well, but it still fails a very simple question:
https://chatgpt.com/share/67a37263-8684-8007-941f-34a9f2322159
This is o3-mini. I didn't tell it that the man can take only one item during each ride, it never cared.
Just ran your queries through DeepSeekR1. In addition to the cool CoT reasoning, it did not make these mistakes.
Yes, I've been very impressed with Deepseek. It is better than the paid for engines! And (a reduced version) runs on my laptop.
I have also been very impressed by DeekSeek-R1 with its reasoning printouts. However in a few cases I did see errors in its reasoning process, which subsequently lead to incorrect final result. This calls into caution if one blindly relies on even the top of line llm reasoning models for result. I see reasoning on natural language word/token level being fundamentally unreliable and extremely inefficient and is not how humans conduct reasoning.
I've been waiting for a summary like this! The hype is truly out of control with all sorts of dubious claims. Most commentors are reading the same information and accepting it at face value. I was in the AI/machine learning world from 2000-2018. I was using reinforcement learning for finding associations between DNA/RNA and a label 0/1 for disease status. I like the article for emphasizing trust. If I get reams of output, do I have to double-check everything. Like the expense report example, reality is quite different from the words of the prophets.
A beautiful piece. Thank you.
Haha great book indeed, one of my fav's.