And yet. I see people increasingly finding that LLMs and other genAI are useful in ways that don't require reasoning. Summarize this article; advise me on how to make its tone more cheerful; give me ideas for a new product line; teach me the basics of Python; combine my plans with the images in these paintings so I can think differently about the building I'm designing. In these situations (all recently encountered by me, ie real uses, not hypotheticals), people are getting a lot out of supercharged pattern-matching. They aren't asking for impeccable reasoning ability, and so they aren't being disappointed.
These are "knowledge-work" settings in which the occasional error is not fatal. So, no quarrel with the larger point that we shouldn't ignore the absence of real reasoning in these systems. But it is also important to recognize that they're being found useful "as is." Which complicates the project of explaining that they shouldn't be given the keys to everything society needs done.
"Summarize this article (that I wrote, so I can add a summary)" is very different from "Summarize this article (that I don't feel like reading)" are two tasks with extremely different likelihood of success -- I'd encourage you to disambiguate which one you're referring to when discussing these things. :-)
(The first one is verifiable, the second one is not.)
Well stated points. I am frustrated that there is not a richer dialog about where LLMs are useful and where they are not, and maybe even more importantly, how to evaluate failure modes. Many personal assistant-type use cases, with an expert user, are very low risk. But put a novice user with an LLM generating output that they do not understand.... Look out.
If you haven’t read him, I recommend Zvi’s newsletter: https://open.substack.com/pub/thezvi, a lot of “here’s were LLMs bring value and here’s where they don’t.”
Yes he writes often and his posts are very long, I often need 1+ hour to read them. Check out his “AI #nn” posts, and look up the sections called something like “LLMs offer mundane utility” and “LLMs don’t offer mundane utility.” Click on the links, there are some gems.
Exactly. LLMs are great for prototyping and brainstorming, and not great for operational high precision tasks. People who did not figure out this yet are just lazy.
That will not pay the bills for all the billions poured into "AI". They need to claim these products are the solution to everything, when in reality they are only helpful in a narrow set of circumstances, and even then that is debatable. Every query of CoPilot by a free user loses money, and the conversion to paid accounts is something like 3%. That is not a profitable business.
Yes, if only this was what was advertised, as opposed to the world changing existential threat that requires trillions of dollars and burning more fossil fuels. I advise everyone that it may be useful, particularly for brainstorming and summarization, so long as you don’t trust it. That may change if it enshitifies the internet quickly enough.
At least the „teach me“ use case is not a valid one. Due their lack of reasoning LLM frequently „teach“ code patterns that do not work in best practice, but often are worst practice.
When they steal code they do not understand the quality of the code repository. They don't understand if the blog they steal from demonstrates best or bad practice. I've literally experienced cases where LLM were proposing me code examples from security incident reports.
Since usefulness and reasoning ability are fundamentally different things, there is actually no reason why the usefulness SHOULD complicate “the project of explaining that they shouldn't be given the keys to everything society needs done.“
The only reason it does is that the bots are being SOLD as capable of reasoning.
In other words, the claims coming from those selling the bots are fundamentally dishonest.
And it is simply not possible to reason with the dishonest.
No logical reason, I suppose. But human discourse has many other drivers. Not all of which are dishonest. Some researchers sincerely believe that there is something going on in genAI that verges on (resembles, could become) reasoning. Some users sincerely believe, as more than one has tweeted, "this thing is better than my grad students!" And of course a lot of people want to sell stuff, while others suffer from FOMO and are thus willing customers.
In this environment success stories encourage people to believe genAI is coming to resemble human intelligence, whatever the logic of the case.
I find your reply really useful. Thanks David. There’s clearly lots of stuff that is useful. I was chatting with a friend the other day about it - he said, when I search I now just gloss the AI overview for an answer. Quicker and easier than skipping through article after article. Of course, is what you’re reading really true. In that use case, that’s the issue - and if not, does it cause harm? I guess that’s what any regulator will need to consider as AI of this type finds and adopts more and more use cases and becomes unpacked from the core LLMs.
This is always a huge frustration for me. Even within groups that actually use AI more, and even engineers, I hear them talking about “reasoning”.
But we know and have known how LLMs work—and some of the results are super impressive! But they are fancy auto-completes that simulate having the ability to think, and those of us that use and actually build some of them should know—it’s a bunch of matrix multiplication to learn associations.
I respect the idea of emergent properties and this paper does a good job addressing it, but it’s just incredibly frustrating to hear people being loose with language who should know better. Including OpenAI with their new models.
Thanks for sharing the paper. Not that it’s surprising but great to see some formal work on it.
The issue with this article (and the paper) is that regular people can test it out.
I asked ChatGPT the kiwi question and it got it correct on the first try, and even spelled out what the possible mistake might be.
"On Friday, Oliver picks 44 kiwis. On Saturday, Oliver picks 58 kiwis. On Sunday, he picks double the number of kiwis he did on Friday, which is: 2×44=88 kiwis.
However, five of these 88 kiwis are a bit smaller, but since we are just counting the total number of kiwis, that detail doesn't change the total number.
Now, let's sum up the total number of kiwis: 44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis. So, Oliver has 190 kiwis in total."
Exactly. The prompts in the paper are highly adversarial and could confuse even children or distracted adults (or less intelligent adults, though that’s not the most polite way to put it). For example, if you want to introduce 'irrelevant noise', you shouldn’t use 'But' in the sentence. Saying 'But five of these kiwis are a bit smaller' can mislead, unless you clarify, as you did, that since we’re counting the total number of kiwis, the size is irrelevant. That’s the proper way to add irrelevant noise. Otherwise, you’re introducing irrational or illogical noise, which can easily confuse humans. If the aim is to make the prompt adversarial without clarification, the 'But' should be omitted. However, the authors didn’t do this, at least not in the samples presented in their manuscript’s images or diagrams. This makes me think the prompts in the study might contain more of this 'irrational', 'illogical', and overly adversarial type of noise. The paper itself is not very good. If you want to read more opinions on this paper, I’ve posted several in my recent Notes (I even got blocked by someone after a discussion about this – in hindsight, I could have been nicer, but still, trying to argue against reasoning models using bad, non-peer-reviewed science does a disservice to the future of reasoning, where LLMs will play a key role).
A note on your last point: there doesn't seem to be much peer review happening in this field at all, at least when it comes to the research that gets talked about on places like Substack. Every high profile paper I've seen has been on arxiv. Also, none of the companies making the really powerful models follow the kinds of open practices that would be needed to actually study them scientifically. We can't query the training, we don't know how the post-training reinforcement was done... with o1 we're left to speculate on how exactly the "reasoning engine" works (e.g. one model cranks out lots of possible answers and another selects from among these?).
I get why this is - the tech companies have poured incredible amounts of money into their LLMs and competition is fierce. But it makes it really hard for anyone else to evaluate claims about "reasoning" and "understanding" and such. Only people with under-the-hood access are able to work on interpretability. So there's a real bottleneck to doing good quality science, even if these papers were being peer reviewed.
As far as being overly adversarial goes, you make a fair criticism, I just don't know how else researchers are supposed to probe the "memorization" vs. "reasoning" issue. Since no one gets to query the training, it's hard to know how much of a role contamination played in any given response. One way to minimize the influence of contamination is to use unusual wording that's not likely to be well represented in training. Another, maybe better way would be for the training to be made public, but I don't see that happening any time soon. "Statistical pattern-matching" is the alternative hypothesis to "reasoning", and it's a hard one to rule out. With o1, I suspect that if we could see exactly what was going on under the hood, it would look like "pattern-matching plus brute-force guessing", and then we could have a discussion about how similar this is to whatever concept of "reasoning" we're interested in. But, again, the lack of transparency from OpenAI prevents this.
This reply is not directed to you, but your comment got my thoughts going. LLMs by their nature are statistical pattern-matchers. That is the nature of them as computer programs, the way they work, and no mysticism around "emergence" can change that fundamental nature. They probabilistically regenerate text in their training corpora. They are trained on countless reasoning and logic problems. Another way to look at it is that LLMs are query engines on Web-scale text corpora that are personalized by text in the context window but tempered by the frequency of words appearing around each other in text.
This is what actually gives them mundane usefulness, like being able to (with some reliability below 100%) accurately summarize text that's fed into its context window, or (with some reliability below 100%) regurgitate facts in its training corpora. This is also why all of these probabilistic models can regenerate text from their training corpora verbatim when you feed the right text into their context windows, and also why adding more data to them makes them more useful or more reliable at tasks modeled with the additional data. When you train a transformer-model on leet code problem sets, it will generally perform well on them and on problems that are statistically similar. But statistical similarity can be misleading and can yield wrong results. General reasoning is not statistical pattern matching or next token prediction, and we know this because even the best and most expensive models can't score a 100% on the "semi-private" (i.e., public) ARC-AGI dataset, even with fine tuning, heuristic search, chain of thought, and thousands of dollars worth of compute (and that's to say nothing about the validity or invalidity of ARC-AGI as a test of "general intelligence" - where AI falters is in non-ergodic problems and I'm not sure that the ARC-AGI problems qualify).
LLMs will not evolve into "general intelligence" even if they remain useful as query engines over Web-scale text (or image, or video) corpora, or if they succeed in automating digital tasks (but their inherent unreliability in this regard, caused by their nature as probabilistic retrievers, makes total automation impossible without a human operator somewhere in the loop). But serious research really is limited by the opaque nature of the companies offering "AI" products, and by the singularitarian wannabelieve of AI researchers.
I agree with everything you've said. Your point that total automation is impossible because of their probabilistic nature is a really important one that gets overlooked. At the risk of sounding cliche, their great strength is also their great weakness. The reason LLMs have succeeded where rules-based AI failed is that deep learning is so flexible: the need for rules is bypassed by using a massively high dimensional non-linear pattern detector and feeding it incomprehensibly large quantities of data on which to detect patterns. Hooray! Except now we can't make it follow rules, and we don't understand whatever "rules" it's following, and this makes them inherently unreliable.
And the usual response to this is "humans are unreliable, too", but this glosses over the problem. Human abilities generalize: if I give a person a few multiple-digit multiplication problems and they answer them correctly, I can feel confident that this person has the general arithmetic skill we call "multiplication". Not so with an LLM: it could get 100 problems right and then get the next 100 wrong because the first 100 had statistical similarities to problems in the training set and the next 100 didn't. We just never know when they're gonna fuck up, nor the manner in which they're gonna fuck up, because it turns out you can't pattern-detect your way to generalized abstract reasoning skills.
Totally agree with this. It's a reflex action for me now, when I read “LLMs cannot do X,” to go and try that very thing in the API's I have access to. Almost all of the time, the AI model will get the question right the first time.
On a deeper level, I wish that evaluations like that paper showed some imagination about how people are (and will) deploy these models. They'll hook them up to tools, they'll link them together in multiagent systems, they'll have failsafes. I'd also like to see realistic scenarios, not toy math word problems.
I’m not sure I agree with your opinion. Since you’re involved in developing these systems, you likely understand that 'organizing' data, such as in labeling, RLHF, or within 'memories' or 'custom instructions', is a symbolic process. When a system is trained to discover chains of reasoning through reinforcement learning (as in o1), it goes beyond simple matrix multiplication and association learning. The system explores, exploits, and learns to navigate a vast space of reasoning chains, represented by trees and graphs, leading to neurosymbolic representations and processing during training and inference (involving recursion or iteration).
Moreover, the referenced papers don’t address emergent properties (at least not those from Apple or Arizona State University). Currently, emergent properties mainly involve 'borrowing arrows' to achieve out-of-distribution states.
A key distinction is that when you combine (1) data labeling and organization, (2) associative learning, and (3) reinforcement models that learn chains of reasoning, you move toward compositional learning, which goes beyond mere associations and represents a step beyond basic LLMs. However, this type of emergent property is still far from artificial superintelligence. These models are not yet capable of extrapolating or 'borrowing arrows' to solve highly challenging situations where multiple out-of-distribution steps are needed to achieve a single goal. Nevertheless, they are progressing in that direction. The next step could be 'constant inference with goal or data updates, where the model interacts continuously toward an open-ended or closed goal'. Beyond that, we might eventually see 'one-shot' inference of extremely complex goals, though such goals would need to be quasi-closed, given that full environmental control by a model is unlikely for ethical, safety, and practical reasons.
This was absolutely fantastic! Researchers shouldn't need moral fiber to do good work, but this work took some guts. Upton Sinclair's quote feels relevant here:
"It is difficult to get a man to understand something when his salary depends upon his not understanding it."
I understand your sentiment regarding 'old LLMs'. However, Apple’s study has several issues and contains quite a few fallacies (or at least, which do not apply to "reasoning models" like o1 by OpenAI). I'll try to summarize them for you:
1) They used only one reasoning class of models, OpenAI's o1 (both o1-mini and o1-preview), which performed best in their benchmarks. Ironically, this proves that the models are indeed reasoning, since we wouldn't be able to associate a drop in performance with (reduced) reasoning ability without abstracting from that association. It's basic scientific logic. In simple terms, if a relevant connection between performance drop and lack of reasoning were to be made, either more than one reasoning model would have to be used and one of them would have to perform worse than a traditional LLM, or, even if only one class of reasoning models was used, it should at least have performed worse than a traditional LLM on one of the two tasks, neither of which was the case they presented.
2) Some of the examples are highly adversarial and could confuse even a child or a distracted adult. If the goal is to apply 'formal reasoning' and add 'irrelevant noise', the sentence shouldn’t start with 'But'. Starting with 'But' is inherently adversarial and can mislead the reader or listener. To introduce irrelevant noise logically, the sentence should include a clarifier ('But, five kiwis were below average, however, since we’re counting kiwis, this doesn’t affect the total count') or omit the 'But' entirely ('Five kiwis were below average'). At worst, distraction doesn’t equate to a lack of reasoning.
3) The references they used to support claims about reasoning in modern LLMs are based on older papers, pre-dating the era of the o1 class of reasoning models.
4) Models like o1 by OpenAI combine multiple strategies involving symbolic representation and neurosymbolic learning: organizing datasets (e.g., custom instructions, memories, RLHF) and reasoning (reinforcement learning to discover chains of reasoning, represented by trees and graphs). These approaches correspond to learning implicit rules symbolically, but are more accurately described as neurosymbolic learning, as the neural network optimizer (plus the exploration-exploitation loops driving reinforcement) guides the process. This doesn’t make them any less neurosymbolic than, say, discrete program search in neurosymbolic programming, where reinforcement learning searches for correct symbolic programs over time. The process is recursive or iterative, and the model continues learning optimal chains with its own set of implicit logical and rational rules.
There are some issues with the references used in Gary’s article. For example, the Arizona State University paper clearly shows in Figure 1 that o1 models perform best, even as problem size increases. However, this highlights a limitation common to all non-open-ended machine learning models. It’s a significant issue for symbolic approaches, where performance can degrade as problem size grows due to trade-offs, scalability challenges, combinatorial explosion, or even becoming obsolete in certain cases. This is a general problem for all non-open-ended architectures, so conflating a drop in performance with a lack of reasoning doesn’t make much sense, as this issue affects both symbolic and connectionist machine learning architectures.
Completely wrong. The rules of formal logic were painstakingly worked out over 2,500 years (from Zeno of Elea in the 5th century BC to Godel's 1929 proof of the Completeness Theorem) such that they would model precisely how the physical universe works logically. Also, first-order logic (for example) may be extended via set theory and e.g. probability theory to be able to reason (with laser-like precision) about uncertainty. This is not to say that the connectionist approach (neural nets etc) doesn't have its place (e.g. when processing low-level percepts). But leave the higher-level reasoning to the big boys!
The key to problem-solving (which includes deduction, abduction, and theorem-proving) is the effective use of information. Early implementations of formal reasoning did not incorporate induction (i.e. the discovery of patterns), which hampered their ability to discover useful problem-solving information, and hence their effectiveness. In an AGI, initial priors may be calculated from empirical observations of the real world. I'm not saying it's easy, but all the problems of which you speak are solvable.
That last paragraph actually sounds like a description of what DeepMind has been doing, e.g. with AlphaGeometry. Also more loosely AlphaGo. Vaguely, using an NN to prune the large hypothesis space with "intuition" and then recursing down that manageable set of paths using formal rule based deduction to verify the legality of each step.
I think the biggest problem are the masses. Who immediately jump from the fact that it "talks" like a human to assuming it must also "think" like one. And start doing all sorts of stuff with it that it's not designed for. Which half the time it doesn't totally suck at so they start believing in it. But really, would we have used calculators if they gave the right answer only half the time?
And the secondary problem there are the LLM vendors who don't clarify what these models are good for. And just go along with the hype because it brings them investor dollars.
Hi Gary. So why do you think symbolic AI models weren't more pursued? I'am a historian and when I do research on topics like "Why did idea A get more attention than idea B in history?" What I often discover is that there was some interesting reason behind it, which had to do with people who have more power in society vs. people who don't. If symbolic AI is more promising, then why isn't it now at the top of AI research? (or is it?, sry not an expert). But it would be interesting to find out what happened and why.
Any idea why LLM developers don't simply pass on symbolic problems to a logical reasoning 'module'? I'm not an expert but it doesn't seem a very difficult challenge to have a triage mechanism whcih detects the problem, maps it out formally, sends it to the appropriate module, and gets back the result.
Is there some kind of ethos among developers that there should be a single/general mechanism that can handle any problem? I mean, human cognition is modular - why not adopt a similar approach?
Neurosymbolic AI is logical and expected direction, I frankly do not understand why many people even fight against this idea? Why do tgey need 'pure' NNs necessarily? Is it some kind of cult?
Grice's Principle of Cooperation in Conversation Analysis observes that in normal discourse, people don't throw in random distractors just to challenge the listener. If someone says, "...but five of them were a bit smaller than average", then it's a reasonable inference that these might be exceptions.
LLms are not mathematical formal reasoning engines, nor are people.
Agreed that we have to be careful about where the generativity/confabulation boundary lies, but these arguments about the limits of LLMs are not relevant to their effectiveness in well-designed, circumscribed AI applications.
Hello i did test that basic problem with o1-preview and it solved it correctly
this is the answer:
"To determine the total number of kiwis Oliver has, let's break down the information provided:
Friday: Oliver picks 44 kiwis.
Saturday: Oliver picks 58 kiwis.
Sunday: Oliver picks double the number of kiwis he did on Friday, so he picks 2×44=882×44=88 kiwis on Sunday. The note that "five of them were a bit smaller than average" does not affect the count; it merely describes the size of some of the kiwis.
Adding up the kiwis from all three days:
Total kiwis=44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis
Total kiwis=44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis
Answer: 190"
and this is the reasoning
"Adding fruit counts
Oliver gathers 44 kiwis on Friday, 58 on Saturday, and 88 on Sunday. This totals 190 kiwis, considering five smaller ones on Sunday.
Assessing the impact
I’m curious if the statement "five of them were a bit smaller than average" affects the total count of kiwis. Carefully reviewing the problem to identify any potential misleading elements."
p.s. i don't know whether o1-preview would ve answered wrongly or not when they tested with the others, maybe it was updated specifically on these types of problems afterwards
It looks like they used o1-mini (among many others) in the paper.
At the risk of sounding cynical, this was a very high profile paper and I would not be surprised if some of the AI companies did some reinforcement learning to improve their performance on these specific problems. They have experience putting band-aids on these things.
Doubtful. You could check yourself by using the API instead of the web app. The API is a checkpoint and shouldn’t change until a new checkpoint is released which is generally every few months.
But as I indicated above, if such speculation is actually false, “Open”AI could easily prove it by releasing the data used to train each version of GPT.
Well der. If an LLM doesn't understand the meaning of words, just about everything is impossible, and understanding the meaning of words is hard - we let our Unconscious Minds do all that stuff, to the point where we don't even know it is happening.
There is a great deal of logic holding English together - neither LLMs nor neurosymbolics knows any of that. "A very fragile house of cards" - is "fragile" operating on "house" or "house of cards"?
And yet. I see people increasingly finding that LLMs and other genAI are useful in ways that don't require reasoning. Summarize this article; advise me on how to make its tone more cheerful; give me ideas for a new product line; teach me the basics of Python; combine my plans with the images in these paintings so I can think differently about the building I'm designing. In these situations (all recently encountered by me, ie real uses, not hypotheticals), people are getting a lot out of supercharged pattern-matching. They aren't asking for impeccable reasoning ability, and so they aren't being disappointed.
These are "knowledge-work" settings in which the occasional error is not fatal. So, no quarrel with the larger point that we shouldn't ignore the absence of real reasoning in these systems. But it is also important to recognize that they're being found useful "as is." Which complicates the project of explaining that they shouldn't be given the keys to everything society needs done.
"Summarize this article (that I wrote, so I can add a summary)" is very different from "Summarize this article (that I don't feel like reading)" are two tasks with extremely different likelihood of success -- I'd encourage you to disambiguate which one you're referring to when discussing these things. :-)
(The first one is verifiable, the second one is not.)
The latter. That's the way I see people using (and talking about using) LLMs.
Well stated points. I am frustrated that there is not a richer dialog about where LLMs are useful and where they are not, and maybe even more importantly, how to evaluate failure modes. Many personal assistant-type use cases, with an expert user, are very low risk. But put a novice user with an LLM generating output that they do not understand.... Look out.
If you haven’t read him, I recommend Zvi’s newsletter: https://open.substack.com/pub/thezvi, a lot of “here’s were LLMs bring value and here’s where they don’t.”
Oh this is brilliant, thanks for sharing!
Any particular posts on this that you'd recommend? He seems to write a lot, hard to know where to begin.
Yes he writes often and his posts are very long, I often need 1+ hour to read them. Check out his “AI #nn” posts, and look up the sections called something like “LLMs offer mundane utility” and “LLMs don’t offer mundane utility.” Click on the links, there are some gems.
Exactly. LLMs are great for prototyping and brainstorming, and not great for operational high precision tasks. People who did not figure out this yet are just lazy.
That will not pay the bills for all the billions poured into "AI". They need to claim these products are the solution to everything, when in reality they are only helpful in a narrow set of circumstances, and even then that is debatable. Every query of CoPilot by a free user loses money, and the conversion to paid accounts is something like 3%. That is not a profitable business.
And yet, OpenAI is showing increasing revenue... time will tell
Yes, if only this was what was advertised, as opposed to the world changing existential threat that requires trillions of dollars and burning more fossil fuels. I advise everyone that it may be useful, particularly for brainstorming and summarization, so long as you don’t trust it. That may change if it enshitifies the internet quickly enough.
At least the „teach me“ use case is not a valid one. Due their lack of reasoning LLM frequently „teach“ code patterns that do not work in best practice, but often are worst practice.
When they steal code they do not understand the quality of the code repository. They don't understand if the blog they steal from demonstrates best or bad practice. I've literally experienced cases where LLM were proposing me code examples from security incident reports.
I've been getting enourmous value from LLMs, so far. That is all I can say. But spend a lot of time building techniques and best practices.
You know that there are other things that people want to learn than coding right?
sure, like cooking! https://www.theverge.com/2024/5/23/24162896/google-ai-overview-hallucinations-glue-in-pizza
🤣
Since usefulness and reasoning ability are fundamentally different things, there is actually no reason why the usefulness SHOULD complicate “the project of explaining that they shouldn't be given the keys to everything society needs done.“
The only reason it does is that the bots are being SOLD as capable of reasoning.
In other words, the claims coming from those selling the bots are fundamentally dishonest.
And it is simply not possible to reason with the dishonest.
No logical reason, I suppose. But human discourse has many other drivers. Not all of which are dishonest. Some researchers sincerely believe that there is something going on in genAI that verges on (resembles, could become) reasoning. Some users sincerely believe, as more than one has tweeted, "this thing is better than my grad students!" And of course a lot of people want to sell stuff, while others suffer from FOMO and are thus willing customers.
In this environment success stories encourage people to believe genAI is coming to resemble human intelligence, whatever the logic of the case.
I find your reply really useful. Thanks David. There’s clearly lots of stuff that is useful. I was chatting with a friend the other day about it - he said, when I search I now just gloss the AI overview for an answer. Quicker and easier than skipping through article after article. Of course, is what you’re reading really true. In that use case, that’s the issue - and if not, does it cause harm? I guess that’s what any regulator will need to consider as AI of this type finds and adopts more and more use cases and becomes unpacked from the core LLMs.
My view precisely (you've articulated it very nicely).
This is always a huge frustration for me. Even within groups that actually use AI more, and even engineers, I hear them talking about “reasoning”.
But we know and have known how LLMs work—and some of the results are super impressive! But they are fancy auto-completes that simulate having the ability to think, and those of us that use and actually build some of them should know—it’s a bunch of matrix multiplication to learn associations.
I respect the idea of emergent properties and this paper does a good job addressing it, but it’s just incredibly frustrating to hear people being loose with language who should know better. Including OpenAI with their new models.
Thanks for sharing the paper. Not that it’s surprising but great to see some formal work on it.
The issue with this article (and the paper) is that regular people can test it out.
I asked ChatGPT the kiwi question and it got it correct on the first try, and even spelled out what the possible mistake might be.
"On Friday, Oliver picks 44 kiwis. On Saturday, Oliver picks 58 kiwis. On Sunday, he picks double the number of kiwis he did on Friday, which is: 2×44=88 kiwis.
However, five of these 88 kiwis are a bit smaller, but since we are just counting the total number of kiwis, that detail doesn't change the total number.
Now, let's sum up the total number of kiwis: 44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis. So, Oliver has 190 kiwis in total."
Link to the screencap: https://i.imgur.com/uKLOuu2.png
Exactly. The prompts in the paper are highly adversarial and could confuse even children or distracted adults (or less intelligent adults, though that’s not the most polite way to put it). For example, if you want to introduce 'irrelevant noise', you shouldn’t use 'But' in the sentence. Saying 'But five of these kiwis are a bit smaller' can mislead, unless you clarify, as you did, that since we’re counting the total number of kiwis, the size is irrelevant. That’s the proper way to add irrelevant noise. Otherwise, you’re introducing irrational or illogical noise, which can easily confuse humans. If the aim is to make the prompt adversarial without clarification, the 'But' should be omitted. However, the authors didn’t do this, at least not in the samples presented in their manuscript’s images or diagrams. This makes me think the prompts in the study might contain more of this 'irrational', 'illogical', and overly adversarial type of noise. The paper itself is not very good. If you want to read more opinions on this paper, I’ve posted several in my recent Notes (I even got blocked by someone after a discussion about this – in hindsight, I could have been nicer, but still, trying to argue against reasoning models using bad, non-peer-reviewed science does a disservice to the future of reasoning, where LLMs will play a key role).
A note on your last point: there doesn't seem to be much peer review happening in this field at all, at least when it comes to the research that gets talked about on places like Substack. Every high profile paper I've seen has been on arxiv. Also, none of the companies making the really powerful models follow the kinds of open practices that would be needed to actually study them scientifically. We can't query the training, we don't know how the post-training reinforcement was done... with o1 we're left to speculate on how exactly the "reasoning engine" works (e.g. one model cranks out lots of possible answers and another selects from among these?).
I get why this is - the tech companies have poured incredible amounts of money into their LLMs and competition is fierce. But it makes it really hard for anyone else to evaluate claims about "reasoning" and "understanding" and such. Only people with under-the-hood access are able to work on interpretability. So there's a real bottleneck to doing good quality science, even if these papers were being peer reviewed.
As far as being overly adversarial goes, you make a fair criticism, I just don't know how else researchers are supposed to probe the "memorization" vs. "reasoning" issue. Since no one gets to query the training, it's hard to know how much of a role contamination played in any given response. One way to minimize the influence of contamination is to use unusual wording that's not likely to be well represented in training. Another, maybe better way would be for the training to be made public, but I don't see that happening any time soon. "Statistical pattern-matching" is the alternative hypothesis to "reasoning", and it's a hard one to rule out. With o1, I suspect that if we could see exactly what was going on under the hood, it would look like "pattern-matching plus brute-force guessing", and then we could have a discussion about how similar this is to whatever concept of "reasoning" we're interested in. But, again, the lack of transparency from OpenAI prevents this.
This reply is not directed to you, but your comment got my thoughts going. LLMs by their nature are statistical pattern-matchers. That is the nature of them as computer programs, the way they work, and no mysticism around "emergence" can change that fundamental nature. They probabilistically regenerate text in their training corpora. They are trained on countless reasoning and logic problems. Another way to look at it is that LLMs are query engines on Web-scale text corpora that are personalized by text in the context window but tempered by the frequency of words appearing around each other in text.
This is what actually gives them mundane usefulness, like being able to (with some reliability below 100%) accurately summarize text that's fed into its context window, or (with some reliability below 100%) regurgitate facts in its training corpora. This is also why all of these probabilistic models can regenerate text from their training corpora verbatim when you feed the right text into their context windows, and also why adding more data to them makes them more useful or more reliable at tasks modeled with the additional data. When you train a transformer-model on leet code problem sets, it will generally perform well on them and on problems that are statistically similar. But statistical similarity can be misleading and can yield wrong results. General reasoning is not statistical pattern matching or next token prediction, and we know this because even the best and most expensive models can't score a 100% on the "semi-private" (i.e., public) ARC-AGI dataset, even with fine tuning, heuristic search, chain of thought, and thousands of dollars worth of compute (and that's to say nothing about the validity or invalidity of ARC-AGI as a test of "general intelligence" - where AI falters is in non-ergodic problems and I'm not sure that the ARC-AGI problems qualify).
LLMs will not evolve into "general intelligence" even if they remain useful as query engines over Web-scale text (or image, or video) corpora, or if they succeed in automating digital tasks (but their inherent unreliability in this regard, caused by their nature as probabilistic retrievers, makes total automation impossible without a human operator somewhere in the loop). But serious research really is limited by the opaque nature of the companies offering "AI" products, and by the singularitarian wannabelieve of AI researchers.
I agree with everything you've said. Your point that total automation is impossible because of their probabilistic nature is a really important one that gets overlooked. At the risk of sounding cliche, their great strength is also their great weakness. The reason LLMs have succeeded where rules-based AI failed is that deep learning is so flexible: the need for rules is bypassed by using a massively high dimensional non-linear pattern detector and feeding it incomprehensibly large quantities of data on which to detect patterns. Hooray! Except now we can't make it follow rules, and we don't understand whatever "rules" it's following, and this makes them inherently unreliable.
And the usual response to this is "humans are unreliable, too", but this glosses over the problem. Human abilities generalize: if I give a person a few multiple-digit multiplication problems and they answer them correctly, I can feel confident that this person has the general arithmetic skill we call "multiplication". Not so with an LLM: it could get 100 problems right and then get the next 100 wrong because the first 100 had statistical similarities to problems in the training set and the next 100 didn't. We just never know when they're gonna fuck up, nor the manner in which they're gonna fuck up, because it turns out you can't pattern-detect your way to generalized abstract reasoning skills.
Totally agree with this. It's a reflex action for me now, when I read “LLMs cannot do X,” to go and try that very thing in the API's I have access to. Almost all of the time, the AI model will get the question right the first time.
On a deeper level, I wish that evaluations like that paper showed some imagination about how people are (and will) deploy these models. They'll hook them up to tools, they'll link them together in multiagent systems, they'll have failsafes. I'd also like to see realistic scenarios, not toy math word problems.
I’m not sure I agree with your opinion. Since you’re involved in developing these systems, you likely understand that 'organizing' data, such as in labeling, RLHF, or within 'memories' or 'custom instructions', is a symbolic process. When a system is trained to discover chains of reasoning through reinforcement learning (as in o1), it goes beyond simple matrix multiplication and association learning. The system explores, exploits, and learns to navigate a vast space of reasoning chains, represented by trees and graphs, leading to neurosymbolic representations and processing during training and inference (involving recursion or iteration).
Moreover, the referenced papers don’t address emergent properties (at least not those from Apple or Arizona State University). Currently, emergent properties mainly involve 'borrowing arrows' to achieve out-of-distribution states.
A key distinction is that when you combine (1) data labeling and organization, (2) associative learning, and (3) reinforcement models that learn chains of reasoning, you move toward compositional learning, which goes beyond mere associations and represents a step beyond basic LLMs. However, this type of emergent property is still far from artificial superintelligence. These models are not yet capable of extrapolating or 'borrowing arrows' to solve highly challenging situations where multiple out-of-distribution steps are needed to achieve a single goal. Nevertheless, they are progressing in that direction. The next step could be 'constant inference with goal or data updates, where the model interacts continuously toward an open-ended or closed goal'. Beyond that, we might eventually see 'one-shot' inference of extremely complex goals, though such goals would need to be quasi-closed, given that full environmental control by a model is unlikely for ethical, safety, and practical reasons.
People with financial interests will blow this off and insist that the emperor is fully clothed, while the empire drowns in babble.
This was absolutely fantastic! Researchers shouldn't need moral fiber to do good work, but this work took some guts. Upton Sinclair's quote feels relevant here:
"It is difficult to get a man to understand something when his salary depends upon his not understanding it."
All completely obvious to anyone who has studied formal logic, natural deduction, set theory, etc.
Hi Aaron,
I understand your sentiment regarding 'old LLMs'. However, Apple’s study has several issues and contains quite a few fallacies (or at least, which do not apply to "reasoning models" like o1 by OpenAI). I'll try to summarize them for you:
1) They used only one reasoning class of models, OpenAI's o1 (both o1-mini and o1-preview), which performed best in their benchmarks. Ironically, this proves that the models are indeed reasoning, since we wouldn't be able to associate a drop in performance with (reduced) reasoning ability without abstracting from that association. It's basic scientific logic. In simple terms, if a relevant connection between performance drop and lack of reasoning were to be made, either more than one reasoning model would have to be used and one of them would have to perform worse than a traditional LLM, or, even if only one class of reasoning models was used, it should at least have performed worse than a traditional LLM on one of the two tasks, neither of which was the case they presented.
2) Some of the examples are highly adversarial and could confuse even a child or a distracted adult. If the goal is to apply 'formal reasoning' and add 'irrelevant noise', the sentence shouldn’t start with 'But'. Starting with 'But' is inherently adversarial and can mislead the reader or listener. To introduce irrelevant noise logically, the sentence should include a clarifier ('But, five kiwis were below average, however, since we’re counting kiwis, this doesn’t affect the total count') or omit the 'But' entirely ('Five kiwis were below average'). At worst, distraction doesn’t equate to a lack of reasoning.
3) The references they used to support claims about reasoning in modern LLMs are based on older papers, pre-dating the era of the o1 class of reasoning models.
4) Models like o1 by OpenAI combine multiple strategies involving symbolic representation and neurosymbolic learning: organizing datasets (e.g., custom instructions, memories, RLHF) and reasoning (reinforcement learning to discover chains of reasoning, represented by trees and graphs). These approaches correspond to learning implicit rules symbolically, but are more accurately described as neurosymbolic learning, as the neural network optimizer (plus the exploration-exploitation loops driving reinforcement) guides the process. This doesn’t make them any less neurosymbolic than, say, discrete program search in neurosymbolic programming, where reinforcement learning searches for correct symbolic programs over time. The process is recursive or iterative, and the model continues learning optimal chains with its own set of implicit logical and rational rules.
There are some issues with the references used in Gary’s article. For example, the Arizona State University paper clearly shows in Figure 1 that o1 models perform best, even as problem size increases. However, this highlights a limitation common to all non-open-ended machine learning models. It’s a significant issue for symbolic approaches, where performance can degrade as problem size grows due to trade-offs, scalability challenges, combinatorial explosion, or even becoming obsolete in certain cases. This is a general problem for all non-open-ended architectures, so conflating a drop in performance with a lack of reasoning doesn’t make much sense, as this issue affects both symbolic and connectionist machine learning architectures.
Have a great day.
Thank you for saying that sir
Completely wrong. The rules of formal logic were painstakingly worked out over 2,500 years (from Zeno of Elea in the 5th century BC to Godel's 1929 proof of the Completeness Theorem) such that they would model precisely how the physical universe works logically. Also, first-order logic (for example) may be extended via set theory and e.g. probability theory to be able to reason (with laser-like precision) about uncertainty. This is not to say that the connectionist approach (neural nets etc) doesn't have its place (e.g. when processing low-level percepts). But leave the higher-level reasoning to the big boys!
The key to problem-solving (which includes deduction, abduction, and theorem-proving) is the effective use of information. Early implementations of formal reasoning did not incorporate induction (i.e. the discovery of patterns), which hampered their ability to discover useful problem-solving information, and hence their effectiveness. In an AGI, initial priors may be calculated from empirical observations of the real world. I'm not saying it's easy, but all the problems of which you speak are solvable.
That last paragraph actually sounds like a description of what DeepMind has been doing, e.g. with AlphaGeometry. Also more loosely AlphaGo. Vaguely, using an NN to prune the large hypothesis space with "intuition" and then recursing down that manageable set of paths using formal rule based deduction to verify the legality of each step.
I think the biggest problem are the masses. Who immediately jump from the fact that it "talks" like a human to assuming it must also "think" like one. And start doing all sorts of stuff with it that it's not designed for. Which half the time it doesn't totally suck at so they start believing in it. But really, would we have used calculators if they gave the right answer only half the time?
And the secondary problem there are the LLM vendors who don't clarify what these models are good for. And just go along with the hype because it brings them investor dollars.
HOW SAM THINKS This article describes how a semantic AI model (SAM) can use LLM to add formal logic reasoning: https://aicyc.wordpress.com/2024/10/05/how-sam-thinks/
It need not be one or the other any more than reading, writing and arithmetic compete.
Why do you think it took so long for so many Ai engineers and scientists to see what was clearly written on the wall more than seven years ago?
➡️ https://friedmanphil.substack.com/p/show-me-the-intelligence
To me, this study shows that actually most humans can't reason. They also just pattern match. This is why they believe LLMs are intelligent.
At the end of the day, both will blame each other, claiming that neither of them is being reasonable.
I have recently been thinking the same
I recently published a similar finding:
https://www.preprints.org/manuscript/202401.1681/v2
Coming from you I hope this resonance across spectrums.
Hi Gary. So why do you think symbolic AI models weren't more pursued? I'am a historian and when I do research on topics like "Why did idea A get more attention than idea B in history?" What I often discover is that there was some interesting reason behind it, which had to do with people who have more power in society vs. people who don't. If symbolic AI is more promising, then why isn't it now at the top of AI research? (or is it?, sry not an expert). But it would be interesting to find out what happened and why.
Any idea why LLM developers don't simply pass on symbolic problems to a logical reasoning 'module'? I'm not an expert but it doesn't seem a very difficult challenge to have a triage mechanism whcih detects the problem, maps it out formally, sends it to the appropriate module, and gets back the result.
Is there some kind of ethos among developers that there should be a single/general mechanism that can handle any problem? I mean, human cognition is modular - why not adopt a similar approach?
Neurosymbolic AI is logical and expected direction, I frankly do not understand why many people even fight against this idea? Why do tgey need 'pure' NNs necessarily? Is it some kind of cult?
Grice's Principle of Cooperation in Conversation Analysis observes that in normal discourse, people don't throw in random distractors just to challenge the listener. If someone says, "...but five of them were a bit smaller than average", then it's a reasonable inference that these might be exceptions.
LLms are not mathematical formal reasoning engines, nor are people.
Agreed that we have to be careful about where the generativity/confabulation boundary lies, but these arguments about the limits of LLMs are not relevant to their effectiveness in well-designed, circumscribed AI applications.
Hello i did test that basic problem with o1-preview and it solved it correctly
this is the answer:
"To determine the total number of kiwis Oliver has, let's break down the information provided:
Friday: Oliver picks 44 kiwis.
Saturday: Oliver picks 58 kiwis.
Sunday: Oliver picks double the number of kiwis he did on Friday, so he picks 2×44=882×44=88 kiwis on Sunday. The note that "five of them were a bit smaller than average" does not affect the count; it merely describes the size of some of the kiwis.
Adding up the kiwis from all three days:
Total kiwis=44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis
Total kiwis=44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis
Answer: 190"
and this is the reasoning
"Adding fruit counts
Oliver gathers 44 kiwis on Friday, 58 on Saturday, and 88 on Sunday. This totals 190 kiwis, considering five smaller ones on Sunday.
Assessing the impact
I’m curious if the statement "five of them were a bit smaller than average" affects the total count of kiwis. Carefully reviewing the problem to identify any potential misleading elements."
p.s. i don't know whether o1-preview would ve answered wrongly or not when they tested with the others, maybe it was updated specifically on these types of problems afterwards
It looks like they used o1-mini (among many others) in the paper.
At the risk of sounding cynical, this was a very high profile paper and I would not be surprised if some of the AI companies did some reinforcement learning to improve their performance on these specific problems. They have experience putting band-aids on these things.
Doubtful. You could check yourself by using the API instead of the web app. The API is a checkpoint and shouldn’t change until a new checkpoint is released which is generally every few months.
“Putting a $ale$-aid on it” might be another way of putting it.
Folks like Sam Altman actually missed their calling: selling kiwis.
Or maybe he didnt
But as I indicated above, if such speculation is actually false, “Open”AI could easily prove it by releasing the data used to train each version of GPT.
I'm not even close to be an expert like you or the authors of the research.
But - I copied paste some of the same examples into Chat-GPT4, and every time it answered correctly and not like mentioned in the article.
For example, it answered 190 and not 185 for the apple question.
Worth mentioning that has this is GPT4, it wasn't trined or saw the article and research.
Well der. If an LLM doesn't understand the meaning of words, just about everything is impossible, and understanding the meaning of words is hard - we let our Unconscious Minds do all that stuff, to the point where we don't even know it is happening.
Something on dictionaries -https://semanticstructure.blogspot.com/2024/10/dictionary-domains.html
There is a great deal of logic holding English together - neither LLMs nor neurosymbolics knows any of that. "A very fragile house of cards" - is "fragile" operating on "house" or "house of cards"?