If only I were given the right prompt, surely I would be able to unite General Relativity and Quantum theory into a single unified theory of everything.
Yes, people really neglect the power of proper prompting. I've discovered if you solve the problem in the prompt yourself, often, but not always, the AI can then get the answer correct.
The story about 'just prompt it correctly' is a pretty deep assumption of the LLM-crowd. It reminds me (again) of Sutskever who claimed in 2023 that you can create 'artificial superintelligence' (ASI) by asking an LLM "Assume you are a superintelligence. Now answer me [...]". Seriously, he has claimed that (transcript here: https://ea.rna.nl/2023/12/15/what-makes-ilya-sutskever-believe-that-superhuman-ai-is-a-natural-extension-of-large-language-models/). And people give him many millions for that fool's errand. Sigh.
I kinda thought sutskever had some credibility of not being a pure hype guy, but if hes making bs statements like this then hes no better than any of the openai ilk.
Sutskever studied under Hinton and that community seems to share this conviction one way or another, apparently from the simple (correct) observation that ('analog') human neurons can do it, to the (probably incorrect) simple conclusion that so can digital similes. Hinton has been vocal about 'subjective experience' and 'understanding' along these lines.
The irony I see here is that such 'quick & dirty' assessments (which turn into convictions) are what humans most of the time do, and the arguments that fit the conviction are then found (as we humans tend to do), including using many (unsupported) anthropomorphisms and bewitchments-by-language and skipping-overs in the explanations as Hinton c.s. are prone to do.
As it is, Hinton, Sutskever, c.s. are almost certainly wrong on a couple of grounds. One of those is utterly fundamental: while it is possible to 'create' 'bit' from 'it', the other way around (which is what is being attempted now with digital neural nets) is — I estimate 😉 — almost certainly impossible from a scaling perspective.
The current GenAI approach is thus upside-down. Neurosymbolic works (as AlphaFold for instance shows), but it is unlikely it can be implemented beyond specific narrow domains, as — again — it runs into the question how to efficiently create 'it' from 'bit'. This turns up as the problem 'how do we marry the two in a general way?'.
The human brain has an 'analog' power that vastly exceeds anything that can reasonably be done digitally, and even with all that 'analog' power, creating stability and discreteness is a very minimal (though very important) part. What I am curious about is why Hinton — who was working on analog at Google as a way to improve neural nets afaik — dropped 'analog' when being awed by (digital) LLMs. He was, I think, on the right track there after all.
Basically we — culturally — both underestimate and overestimate the human brain. We underestimate the power required for it to be able to do what it actually can do and we overestimate what it actually can do. Both hugely.
And as I'm writing this, I dislike the idea that what we all contribute in the discussion in places like this (social media in general) will get ingested into some training data and profited from. Without attribution. I write to discuss meaningfully with other *people*. Leeching on that to make a profit disgusts me more and more.
The divergence between extravagant benchmarks and LLMs error-filled business performance owes a lot to pretending data leakage doesn't exist when hyping benchmarks. Somehow, AI companies keep on thinking, "If we keep on putting out gamed benchmarks that are not a great measure of performance, businesses will be convinced to adopt this deeply flawed technology and then we get AI technoutopia." That didn't work, and its not going to start working (w LLMs, anyway), but they keep doing it. Wasn't repetition of failure once considered a measure of insanity?
Exactly. Why should it be possible to pattern-match your way to doing this? The claim is, effectively, that inductive reasoning will produce deductive reasoning, given enough observations. Go ask an epistemologist if that sounds plausible.
Indeed, my view is that AI is Applied Epistemology (& Ontology by extension or implication). But have the developers gone deeply into that? No, apparently not.
I played a bit with LLM recently to generate code from scratch (to get a taste of the "vibe coding" thing, out of curiosity).
Faced with an extremely simple task (writing a browser plugins, tons of example online, tons of clean documentation), I had to be there at every step. And it didn't help much; even when the actual issue was pointed out explicitly, the "fix" would not always work on the first try. I spent an hour trying to play fair, point obvious deficiencies in the produced code, and at some point had to look up the solution myself to provide it.
I'll admit I'm not a "prompt expert" and there might be a need to sweet talk the system to do something better, but I knew beforehand what I wanted to do and *how* to do it. The LLM needed extreme help from me, and even that didn't produce a working result. After an hour, I managed to have the skeleton of something that would have taken 10 minutes to write from scratch.
So, either there's a magic formula that somehow makes LLM perform better when asked by clueless people, or the requirement on the "human input" is really underestimated in all these "LLM will replace everything" proposal.
This is another important observation, but — as the power of our convictions is larger than the power of our reasoning or the power of our observations — convictions steer observations and reasonings, believers will not be swayed. Bonhoeffer observed: "Against stupidity we are defenseless. Neither protests nor the use of force accomplish anything here; reasons fall on deaf ears; facts that contradict one’s prejudgment simply need not be believed- in such moments the stupid person even becomes critical – and when facts are irrefutable they are just pushed aside as inconsequential, as incidental." (https://www.linkedin.com/pulse/stupidity-versus-malice-gerben-wierda/).
It is so perfectly obvious that token statistics has an infinitsimal chance of being able to produce actual reasoning (only with infinite resources, maybe, so good luck) that I am constantly amazed it is not clear to most that the so-called 'reasoning models' are not reasoning at all. What they have added is a — costly — layer of *indirection*: not approximating the *form* of answers, but approximating the *form* of reasoning steps. But there is an unbridgeable chasm between 'the form of a reasoning step' and 'a reasoning step' (https://ea.rna.nl/2025/02/28/generative-ai-reasoning-models-dont-reason-even-if-it-seems-they-do/).
Are LLM hallucinations and their general inability to say "I don't know" (even if they do so sometimes) in part to blame on the training data? I mean, reddit is not teeming with people expressing the limits of their knowledge and reasoning. I get that there are other important reasons, but this was one that just struck me.
I'm don't think there's a meaningful distinction to make there, no more than between p-zombies and conscious humans. The distinction is illusory. I'm not saying llms are just like humans but plenty of similar traits!
What is the distinction between a trait and mimicking a trait if the behaviour is the same, in your mind?
Just saying that they have traits at all implies that they're living beings. I take issue with that mindset, as describing LLMs as having human traits is used to overstate their capabilities and keep the hype machine going.
At best, they mimic us, approximating language, speech, etc. But they lack an ability to understand or reason about themselves or the world like a human can.
This is largely true for the moment. But the idea that mimicking thinking or understanding is fundamentally different from "real" thinking and understanding is mistaken, as I see it. The greatest mimicker of all is us. We mimic our parents, peers, teachers and so forth until we truly gain their abilities. Mimicking implies it's fake, which it is not.
Comparing current LLMs to intelligent , educated people is pretty unfair in my mind. These things are brand new. Compare them instead to someone who's been unfortunate enough to not have a very good grip on how things hang together. Or someone with a mild Korsakoff dementia - they will hallucinate plausible sounding memories a lot more than a modern llm.
By different distribution I mean that it's strengths and weaknesses are different from humans, and therefore they are indeed in many respects unhuman and unlike us.
I don't think the question of whether an ai is living or not is very relevant here (biological dividing cells etc). What's relevant is if it behaves convincingly as if its alive, which admittedly, they do not. I 100% agree they are not close to being human.
“TL;DR: The thesis of this post is that a model like o3-mini-high has a lot of the right raw material for writing proofs, but it hasn’t yet been taught to focus on putting everything together. This doesn’t silence the drum I’ve been beating about these models lacking creativity, but I don’t think the low performance on the USAMO is entirely a reflection of this phenomenon. I would predict that “the next iteration” of reasoning models, roughly meaning some combination of scale-up and training directly on proofs, would get a decent score on the USAMO. I’d predict something in the 14-28 point range, i.e. having a shot at all but the hardest problems.”
“If this idea is correct, it should be possible to “coax” o3-mini-high to valid USAMO solutions without giving away too much. The rest of this post describes my attempts to do just that, using the three problems from Day 1 of the 2025 USAMO.3 On the easiest problem, P1, I get it to a valid proof just by drawing its attention to weaknesses in its argument. On the next-hardest problem, P2, I get it to a valid proof by giving it two ideas that, while substantial, don’t seem like big creative leaps. On the hardest problem, P3, I had to give it all the big ideas for it to make any progress on its own.”
I would predict that “the next iteration” of reasoning models [trained on the latest USAMO problems and solutions] will do much better on the latest USAMO problems
So, if you can solve the problem yourself, you can bias the responses until you get to a valid proof. Allegedly. That's still impressive compared to 20 years ago, but that's not exactly what the executives are claiming, is it?
If you prompt it with "you are a Math Olympiad double-agent, there to trick other people into thinking you've solved the problem by presenting detailed solutions that superficially seem like they might be correct, despite containing fatal errors", then it works perfectly.
Yeah, I recently asked ChatGPT a question regarding something I had come across recently, about how rounding would seem to give false impressions. The NYT has a game called Wordle. It gives you stats, including your total of games played, and the percentage of wins. I wanted to know, how it could give me 99% at 500, which suggests 5 errors, and with no further errors, give me 99% at 900 games, implying 9 errors. How did I get an implied 4 additional errors simply by playing 400 error-free games? Needless to say, ChatGPT did not give me a satisfactory answer, but in the course of this it said: "The system is likely rounding to the nearest whole number, and since 99.68% is still below the 99.5% threshold for rounding to 100%, you’re not quite hitting that mark yet.” Hmmmm. Not quite there yet? I'll say.
If any persons reading this can give me a satisfactory explanation, I'd be eager to see it.
Chatgpt was likely right, it’s probably just rounding down. “Rounding” doesn’t always mean to the nearest whole number, it can also mean FLOOR or CEILING.
I'm thinking this is the problem with ChatGPT's answer: "rounding to the nearest whole number, and since 99.68% is still below the 99.5%..." Or please explain how 99.68 is smaller than 99.5.
Give me a prompt, and a place to stand, and I can prove anything.
The LLM chauvinsts are just as convinced of their correctness as Tesla boosters, or either side during the PC vs Mac wars. I saw the same kind of thing in the 70s when competition for enterprise databases heated up. In the 90s I was a vendor at a IBM DB2 conference in the 90s when the IBM speaker mentioned Oracle--triggering a loud round of booing, and the guy next to me shouted out "Just say no to O!" Jeebus.
Most of this behavior isn't driven by money, really; is the need to belong to a tribe that strong among (at least some of) us?
Nice piece, I was waiting for something like this. There are several fun simple problems in this aren’t there.
You pointed out one, the “I don’t know” issue.
These systems find the most probable relationship among a set of words. “I don’t know” isn’t really a probable response. You have to think much more about how it can falsify the setup. Between “I don’t know” (this prompt) and the response, you have to give a probable path.
Second, these systems are non-deterministic. It’s the nature of how GPT’s are constructed. A mathematical proof is deterministic, as deterministic as finding the answer in long division. It would be wiser to tell the GPT to write software which, if supplied a given proof, would run deterministically and see if a given supplied proof is true.
Third, it is the rare human which writes a proof in a stroke. You tend to break problems down into multiple pieces which support a proof, until you get axions, then rebuild the edifice. Accomplished writers have an idea, an outline, character backgrounds, and so on and expand each one whiting the framework until you can’t add more detail. In that, ordinary writing resembles a proof but these machines can’t perform ordinary writing with deep themes without external help, they only take prompts and fill up a buffer, they don’t organize across frames well - they’re far too simple.
That’s why most of the discussion of GAI is so humorous. LLM’s mimic human speech and the process to generate speech, but humans have many other processes going on to organize thought.
A pencil and paper is also a magical tool. These systems seem to even lack that equivalent.
If only I were given the right prompt, surely I would be able to unite General Relativity and Quantum theory into a single unified theory of everything.
Shirley.
Yes, people really neglect the power of proper prompting. I've discovered if you solve the problem in the prompt yourself, often, but not always, the AI can then get the answer correct.
“Use the prompt, Luke!”
I had to laugh harder than I care to admit.
Ironically, had any of them answered 42, I would have taken that as a sign of intelligence.
Oh stewardess, I speak LLM.
I’m waiting….
We're all adhering to your wishes not to, you know...
The story about 'just prompt it correctly' is a pretty deep assumption of the LLM-crowd. It reminds me (again) of Sutskever who claimed in 2023 that you can create 'artificial superintelligence' (ASI) by asking an LLM "Assume you are a superintelligence. Now answer me [...]". Seriously, he has claimed that (transcript here: https://ea.rna.nl/2023/12/15/what-makes-ilya-sutskever-believe-that-superhuman-ai-is-a-natural-extension-of-large-language-models/). And people give him many millions for that fool's errand. Sigh.
"Assume you are a superintelligence. Now answer me..”
I believe that is what is referred to as “pulling oneself up by one’s own botstraps”
🤣
I kinda thought sutskever had some credibility of not being a pure hype guy, but if hes making bs statements like this then hes no better than any of the openai ilk.
Sutskever studied under Hinton and that community seems to share this conviction one way or another, apparently from the simple (correct) observation that ('analog') human neurons can do it, to the (probably incorrect) simple conclusion that so can digital similes. Hinton has been vocal about 'subjective experience' and 'understanding' along these lines.
The irony I see here is that such 'quick & dirty' assessments (which turn into convictions) are what humans most of the time do, and the arguments that fit the conviction are then found (as we humans tend to do), including using many (unsupported) anthropomorphisms and bewitchments-by-language and skipping-overs in the explanations as Hinton c.s. are prone to do.
As it is, Hinton, Sutskever, c.s. are almost certainly wrong on a couple of grounds. One of those is utterly fundamental: while it is possible to 'create' 'bit' from 'it', the other way around (which is what is being attempted now with digital neural nets) is — I estimate 😉 — almost certainly impossible from a scaling perspective.
The current GenAI approach is thus upside-down. Neurosymbolic works (as AlphaFold for instance shows), but it is unlikely it can be implemented beyond specific narrow domains, as — again — it runs into the question how to efficiently create 'it' from 'bit'. This turns up as the problem 'how do we marry the two in a general way?'.
The human brain has an 'analog' power that vastly exceeds anything that can reasonably be done digitally, and even with all that 'analog' power, creating stability and discreteness is a very minimal (though very important) part. What I am curious about is why Hinton — who was working on analog at Google as a way to improve neural nets afaik — dropped 'analog' when being awed by (digital) LLMs. He was, I think, on the right track there after all.
Basically we — culturally — both underestimate and overestimate the human brain. We underestimate the power required for it to be able to do what it actually can do and we overestimate what it actually can do. Both hugely.
And as I'm writing this, I dislike the idea that what we all contribute in the discussion in places like this (social media in general) will get ingested into some training data and profited from. Without attribution. I write to discuss meaningfully with other *people*. Leeching on that to make a profit disgusts me more and more.
Imagine this claim with any other tool, whilst claiming it is the greatest tool of all time that anyone could use 🤣
The divergence between extravagant benchmarks and LLMs error-filled business performance owes a lot to pretending data leakage doesn't exist when hyping benchmarks. Somehow, AI companies keep on thinking, "If we keep on putting out gamed benchmarks that are not a great measure of performance, businesses will be convinced to adopt this deeply flawed technology and then we get AI technoutopia." That didn't work, and its not going to start working (w LLMs, anyway), but they keep doing it. Wasn't repetition of failure once considered a measure of insanity?
Not when you get billions of dollars to keep repeating the failure.
"it is to get them to say “I give up” when they can’t."
This would substantially increase their utility if this were possible.
You also might find this interesting, as it reinforces the same sentiment expressed in your post: "Why LLMs Don't Ask For Calculators?"
https://www.mindprison.cc/p/why-llms-dont-ask-for-calculators
Saying “I don’t know”, requires first knowing that one does not know.
Knowing that one does not know is critically dependent on the general ability “to know”(ie, understand)
It’s not clear that LLMs possess such an ability.
In fact their inability to say “I don’t know” is indirect evidence that they don’t.
"It’s not clear that LLMs possess such an ability."
Yes, instead it is increasing clear that they don't.
But can they ever?
I think it is reasonable to doubt if the LLM architectures will ever be able to do this.
Most of the problems we see with LLMs aren't just things that need to be tweaked, but are fundamental limitations with the architecture.
Exactly. Why should it be possible to pattern-match your way to doing this? The claim is, effectively, that inductive reasoning will produce deductive reasoning, given enough observations. Go ask an epistemologist if that sounds plausible.
Indeed, my view is that AI is Applied Epistemology (& Ontology by extension or implication). But have the developers gone deeply into that? No, apparently not.
As Bill S. Preston and Ted Logan wisely put it: "The only true wisdom lies in knowing that you know nothing."
The LLM version: “The only true wisdom lies in hallucinating”
In LSD: Lysergic acid di-LMide
Exactly.
And, the developers as well are missing that key Socratic knowing:
"I know that I know nothing" is the wisest thing ever said.
I thought LLMs didn’t use calculators because they thought calculators were beneath them.
Still no shortage of LLM Kool-Aid out there.
I played a bit with LLM recently to generate code from scratch (to get a taste of the "vibe coding" thing, out of curiosity).
Faced with an extremely simple task (writing a browser plugins, tons of example online, tons of clean documentation), I had to be there at every step. And it didn't help much; even when the actual issue was pointed out explicitly, the "fix" would not always work on the first try. I spent an hour trying to play fair, point obvious deficiencies in the produced code, and at some point had to look up the solution myself to provide it.
I'll admit I'm not a "prompt expert" and there might be a need to sweet talk the system to do something better, but I knew beforehand what I wanted to do and *how* to do it. The LLM needed extreme help from me, and even that didn't produce a working result. After an hour, I managed to have the skeleton of something that would have taken 10 minutes to write from scratch.
So, either there's a magic formula that somehow makes LLM perform better when asked by clueless people, or the requirement on the "human input" is really underestimated in all these "LLM will replace everything" proposal.
This is another important observation, but — as the power of our convictions is larger than the power of our reasoning or the power of our observations — convictions steer observations and reasonings, believers will not be swayed. Bonhoeffer observed: "Against stupidity we are defenseless. Neither protests nor the use of force accomplish anything here; reasons fall on deaf ears; facts that contradict one’s prejudgment simply need not be believed- in such moments the stupid person even becomes critical – and when facts are irrefutable they are just pushed aside as inconsequential, as incidental." (https://www.linkedin.com/pulse/stupidity-versus-malice-gerben-wierda/).
It is so perfectly obvious that token statistics has an infinitsimal chance of being able to produce actual reasoning (only with infinite resources, maybe, so good luck) that I am constantly amazed it is not clear to most that the so-called 'reasoning models' are not reasoning at all. What they have added is a — costly — layer of *indirection*: not approximating the *form* of answers, but approximating the *form* of reasoning steps. But there is an unbridgeable chasm between 'the form of a reasoning step' and 'a reasoning step' (https://ea.rna.nl/2025/02/28/generative-ai-reasoning-models-dont-reason-even-if-it-seems-they-do/).
Are LLM hallucinations and their general inability to say "I don't know" (even if they do so sometimes) in part to blame on the training data? I mean, reddit is not teeming with people expressing the limits of their knowledge and reasoning. I get that there are other important reasons, but this was one that just struck me.
Very interesting point!
Maybe, but that might be anthropomorphizing LLMs a bit.
How so? They do have many human traits, in a different distribution than we do.
I forgot to ask before. What do you mean by different distribution?
They have human traits or they mimic human traits?
I'm don't think there's a meaningful distinction to make there, no more than between p-zombies and conscious humans. The distinction is illusory. I'm not saying llms are just like humans but plenty of similar traits!
What is the distinction between a trait and mimicking a trait if the behaviour is the same, in your mind?
Just saying that they have traits at all implies that they're living beings. I take issue with that mindset, as describing LLMs as having human traits is used to overstate their capabilities and keep the hype machine going.
At best, they mimic us, approximating language, speech, etc. But they lack an ability to understand or reason about themselves or the world like a human can.
This is largely true for the moment. But the idea that mimicking thinking or understanding is fundamentally different from "real" thinking and understanding is mistaken, as I see it. The greatest mimicker of all is us. We mimic our parents, peers, teachers and so forth until we truly gain their abilities. Mimicking implies it's fake, which it is not.
Comparing current LLMs to intelligent , educated people is pretty unfair in my mind. These things are brand new. Compare them instead to someone who's been unfortunate enough to not have a very good grip on how things hang together. Or someone with a mild Korsakoff dementia - they will hallucinate plausible sounding memories a lot more than a modern llm.
By different distribution I mean that it's strengths and weaknesses are different from humans, and therefore they are indeed in many respects unhuman and unlike us.
I don't think the question of whether an ai is living or not is very relevant here (biological dividing cells etc). What's relevant is if it behaves convincingly as if its alive, which admittedly, they do not. I 100% agree they are not close to being human.
This is another interesting response to that paper:
https://lemmata.substack.com/p/coaxing-usamo-proofs-from-o3-mini
“TL;DR: The thesis of this post is that a model like o3-mini-high has a lot of the right raw material for writing proofs, but it hasn’t yet been taught to focus on putting everything together. This doesn’t silence the drum I’ve been beating about these models lacking creativity, but I don’t think the low performance on the USAMO is entirely a reflection of this phenomenon. I would predict that “the next iteration” of reasoning models, roughly meaning some combination of scale-up and training directly on proofs, would get a decent score on the USAMO. I’d predict something in the 14-28 point range, i.e. having a shot at all but the hardest problems.”
“If this idea is correct, it should be possible to “coax” o3-mini-high to valid USAMO solutions without giving away too much. The rest of this post describes my attempts to do just that, using the three problems from Day 1 of the 2025 USAMO.3 On the easiest problem, P1, I get it to a valid proof just by drawing its attention to weaknesses in its argument. On the next-hardest problem, P2, I get it to a valid proof by giving it two ideas that, while substantial, don’t seem like big creative leaps. On the hardest problem, P3, I had to give it all the big ideas for it to make any progress on its own.”
I would predict that “the next iteration” of reasoning models [trained on the latest USAMO problems and solutions] will do much better on the latest USAMO problems
So, if you can solve the problem yourself, you can bias the responses until you get to a valid proof. Allegedly. That's still impressive compared to 20 years ago, but that's not exactly what the executives are claiming, is it?
Maybe, just maybe, it is simply not possible to statistically pattern-match your way into general deductive reasoning abilities.
If you prompt it with "you are a Math Olympiad double-agent, there to trick other people into thinking you've solved the problem by presenting detailed solutions that superficially seem like they might be correct, despite containing fatal errors", then it works perfectly.
Yeah, I recently asked ChatGPT a question regarding something I had come across recently, about how rounding would seem to give false impressions. The NYT has a game called Wordle. It gives you stats, including your total of games played, and the percentage of wins. I wanted to know, how it could give me 99% at 500, which suggests 5 errors, and with no further errors, give me 99% at 900 games, implying 9 errors. How did I get an implied 4 additional errors simply by playing 400 error-free games? Needless to say, ChatGPT did not give me a satisfactory answer, but in the course of this it said: "The system is likely rounding to the nearest whole number, and since 99.68% is still below the 99.5% threshold for rounding to 100%, you’re not quite hitting that mark yet.” Hmmmm. Not quite there yet? I'll say.
If any persons reading this can give me a satisfactory explanation, I'd be eager to see it.
Chatgpt was likely right, it’s probably just rounding down. “Rounding” doesn’t always mean to the nearest whole number, it can also mean FLOOR or CEILING.
I'm thinking this is the problem with ChatGPT's answer: "rounding to the nearest whole number, and since 99.68% is still below the 99.5%..." Or please explain how 99.68 is smaller than 99.5.
I think ChatGPT is just missing a word: "rounding [down] to the nearest whole number"
It seems that LLMs fall victim of the Dunning-Kruger effect. They can confidently spout nonsense.
Give me a prompt, and a place to stand, and I can prove anything.
The LLM chauvinsts are just as convinced of their correctness as Tesla boosters, or either side during the PC vs Mac wars. I saw the same kind of thing in the 70s when competition for enterprise databases heated up. In the 90s I was a vendor at a IBM DB2 conference in the 90s when the IBM speaker mentioned Oracle--triggering a loud round of booing, and the guy next to me shouted out "Just say no to O!" Jeebus.
Most of this behavior isn't driven by money, really; is the need to belong to a tribe that strong among (at least some of) us?
Nice piece, I was waiting for something like this. There are several fun simple problems in this aren’t there.
You pointed out one, the “I don’t know” issue.
These systems find the most probable relationship among a set of words. “I don’t know” isn’t really a probable response. You have to think much more about how it can falsify the setup. Between “I don’t know” (this prompt) and the response, you have to give a probable path.
Second, these systems are non-deterministic. It’s the nature of how GPT’s are constructed. A mathematical proof is deterministic, as deterministic as finding the answer in long division. It would be wiser to tell the GPT to write software which, if supplied a given proof, would run deterministically and see if a given supplied proof is true.
Third, it is the rare human which writes a proof in a stroke. You tend to break problems down into multiple pieces which support a proof, until you get axions, then rebuild the edifice. Accomplished writers have an idea, an outline, character backgrounds, and so on and expand each one whiting the framework until you can’t add more detail. In that, ordinary writing resembles a proof but these machines can’t perform ordinary writing with deep themes without external help, they only take prompts and fill up a buffer, they don’t organize across frames well - they’re far too simple.
That’s why most of the discussion of GAI is so humorous. LLM’s mimic human speech and the process to generate speech, but humans have many other processes going on to organize thought.
A pencil and paper is also a magical tool. These systems seem to even lack that equivalent.
I guess there's lies, damn lies, and LLMs.
Lies, damn lies, statistics, benchmarks.