LLMs don’t do formal reasoning - and that is…

Oct 11, 2024

Important new study from Apple

173 Comments

And yet. I see people increasingly finding that LLMs and other genAI are useful in ways that don't require reasoning. Summarize this article; advise me on how to make its tone more cheerful; give me ideas for a new product line; teach me the basics of Python; combine my plans with the images in these paintings so I can think differently about the building I'm designing. In these situations (all recently encountered by me, ie real uses, not hypotheticals), people are getting a lot out of supercharged pattern-matching. They aren't asking for impeccable reasoning ability, and so they aren't being disappointed.

These are "knowledge-work" settings in which the occasional error is not fatal. So, no quarrel with the larger point that we shouldn't ignore the absence of real reasoning in these systems. But it is also important to recognize that they're being found useful "as is." Which complicates the project of explaining that they shouldn't be given the keys to everything society needs done.

Expand full comment

"Summarize this article (that I wrote, so I can add a summary)" is very different from "Summarize this article (that I don't feel like reading)" are two tasks with extremely different likelihood of success -- I'd encourage you to disambiguate which one you're referring to when discussing these things. :-)

(The first one is verifiable, the second one is not.)

Expand full comment

The latter. That's the way I see people using (and talking about using) LLMs.

Expand full comment

And that is a very dangerous use case. A large language model takes a prompt, and returns something that is shaped like an answer.

Expand full comment

Well stated points. I am frustrated that there is not a richer dialog about where LLMs are useful and where they are not, and maybe even more importantly, how to evaluate failure modes. Many personal assistant-type use cases, with an expert user, are very low risk. But put a novice user with an LLM generating output that they do not understand.... Look out.

Expand full comment

If you haven’t read him, I recommend Zvi’s newsletter: https://open.substack.com/pub/thezvi, a lot of “here’s were LLMs bring value and here’s where they don’t.”

Expand full comment

David Szabo-Stuban

Oh this is brilliant, thanks for sharing!

Expand full comment

Unobserved Observer

Any particular posts on this that you'd recommend? He seems to write a lot, hard to know where to begin.

Expand full comment

Yes he writes often and his posts are very long, I often need 1+ hour to read them. Check out his “AI #nn” posts, and look up the sections called something like “LLMs offer mundane utility” and “LLMs don’t offer mundane utility.” Click on the links, there are some gems.

Expand full comment

Eric Cort Platt

Mar 28Edited

"...a richer dialog about where LLMs are useful and where they are not.." Exactly, Thanks for saying that. I am currently writing essays and compiling notes for a book on AI & Consciousness (roughly), and the philosophical errors that are (indirectly) leading to the the kinds of crisis we are finding ourselves in.

Expand full comment

Exactly. LLMs are great for prototyping and brainstorming, and not great for operational high precision tasks. People who did not figure out this yet are just lazy.

Expand full comment

That will not pay the bills for all the billions poured into "AI". They need to claim these products are the solution to everything, when in reality they are only helpful in a narrow set of circumstances, and even then that is debatable. Every query of CoPilot by a free user loses money, and the conversion to paid accounts is something like 3%. That is not a profitable business.

Expand full comment

And yet, OpenAI is showing increasing revenue... time will tell

Expand full comment

Aug 17Edited

Increasing revenue... and also increasing losses. OpenAI is a financially insolvent company.

Expand full comment

Yes, if only this was what was advertised, as opposed to the world changing existential threat that requires trillions of dollars and burning more fossil fuels. I advise everyone that it may be useful, particularly for brainstorming and summarization, so long as you don’t trust it. That may change if it enshitifies the internet quickly enough.

Expand full comment

Oct 12, 2024Edited

At least the „teach me“ use case is not a valid one. Due their lack of reasoning LLM frequently „teach“ code patterns that do not work in best practice, but often are worst practice.

When they steal code they do not understand the quality of the code repository. They don't understand if the blog they steal from demonstrates best or bad practice. I've literally experienced cases where LLM were proposing me code examples from security incident reports.

Expand full comment

One 'teach-me' case I have found useful is what amounts to a literature survey -so I get a sample of papers, which I can then read, and which lead me on to other important papers, and so on.

Beyond that, however, I am a lot less confident.

Expand full comment

I've been getting enourmous value from LLMs, so far. That is all I can say. But spend a lot of time building techniques and best practices.

Expand full comment

David Szabo-Stuban

You know that there are other things that people want to learn than coding right?

Expand full comment

sure, like cooking! https://www.theverge.com/2024/5/23/24162896/google-ai-overview-hallucinations-glue-in-pizza

Expand full comment

David Szabo-Stuban

🤣

Expand full comment

Those things will have the same problems.

Expand full comment

Since usefulness and reasoning ability are fundamentally different things, there is actually no reason why the usefulness SHOULD complicate “the project of explaining that they shouldn't be given the keys to everything society needs done.“

The only reason it does is that the bots are being SOLD as capable of reasoning.

In other words, the claims coming from those selling the bots are fundamentally dishonest.

And it is simply not possible to reason with the dishonest.

Expand full comment

No logical reason, I suppose. But human discourse has many other drivers. Not all of which are dishonest. Some researchers sincerely believe that there is something going on in genAI that verges on (resembles, could become) reasoning. Some users sincerely believe, as more than one has tweeted, "this thing is better than my grad students!" And of course a lot of people want to sell stuff, while others suffer from FOMO and are thus willing customers.

In this environment success stories encourage people to believe genAI is coming to resemble human intelligence, whatever the logic of the case.

Expand full comment

I find your reply really useful. Thanks David. There’s clearly lots of stuff that is useful. I was chatting with a friend the other day about it - he said, when I search I now just gloss the AI overview for an answer. Quicker and easier than skipping through article after article. Of course, is what you’re reading really true. In that use case, that’s the issue - and if not, does it cause harm? I guess that’s what any regulator will need to consider as AI of this type finds and adopts more and more use cases and becomes unpacked from the core LLMs.

Expand full comment

Yes Scott. Gary Marcus highlighted an example of this. I think the question was ‘is tripe kosher’ and the AI response suggested it was to do with the religion of the cow. Completely nuts stuff. That’s obviously wrong but there’ll be millions of examples where things are imperceptibly wrong, where you’d need to be an expert or have significant knowledge to know it’s incorrect. The world is going to be a pretty strange place soon!

Expand full comment

This is always a huge frustration for me. Even within groups that actually use AI more, and even engineers, I hear them talking about “reasoning”.

But we know and have known how LLMs work—and some of the results are super impressive! But they are fancy auto-completes that simulate having the ability to think, and those of us that use and actually build some of them should know—it’s a bunch of matrix multiplication to learn associations.

I respect the idea of emergent properties and this paper does a good job addressing it, but it’s just incredibly frustrating to hear people being loose with language who should know better. Including OpenAI with their new models.

Thanks for sharing the paper. Not that it’s surprising but great to see some formal work on it.

Expand full comment

The issue with this article (and the paper) is that regular people can test it out.

I asked ChatGPT the kiwi question and it got it correct on the first try, and even spelled out what the possible mistake might be.

"On Friday, Oliver picks 44 kiwis. On Saturday, Oliver picks 58 kiwis. On Sunday, he picks double the number of kiwis he did on Friday, which is: 2×44=88 kiwis.

However, five of these 88 kiwis are a bit smaller, but since we are just counting the total number of kiwis, that detail doesn't change the total number.

Now, let's sum up the total number of kiwis: 44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis. So, Oliver has 190 kiwis in total."

Link to the screencap: https://i.imgur.com/uKLOuu2.png

Expand full comment

Totally agree with this. It's a reflex action for me now, when I read “LLMs cannot do X,” to go and try that very thing in the API's I have access to. Almost all of the time, the AI model will get the question right the first time.

On a deeper level, I wish that evaluations like that paper showed some imagination about how people are (and will) deploy these models. They'll hook them up to tools, they'll link them together in multiagent systems, they'll have failsafes. I'd also like to see realistic scenarios, not toy math word problems.

Expand full comment

Unfortunately, testing it out only works for the earliest of people because the AI companies are actively integrating new results. Why do you think it is that early on Chat GPT had an area to correct the AI, where users would report an incorrect output and what the output should have been. These models are constantly being updated to include reams of new data on areas highlighted as weaknesses. Newer ChatGPT has gotten substantially better at large sums (didn't try large multplications as much, but I think it's similar there) because they can easily generate endless reams of that data to train it on. But as it is a computer program, ANY error of addition is strong evidence that it doesn't have a fundamental understanding of the process of addition. Since we can discount human-type errors as an AI cannot forget to carry a number or accidentally write the wrong one in the way a person can, so any error indicates a lack of understanding of the process. So it's like... yes, you and thousands of other people tripped over yourselves rushing to teach the LLM the solution to this reasoning problem. Good for you. But if it can truly reason, then it shouldn't fail on ANY instance of a problem with similar reasoning but some insubstantial variations. But we CAN find those failures. It's a severe error to think that because you brought it the same problem mentioned and it got it right in your case that it has "reasoned" in any substantial way or that it solves the problem robustly. There is substantial randomness in the responses such that even asking the problem with literally the same text you can get correct answers and incorrect answers. When we say that an LLM cannot reason... it's a stochastic model. Maybe it will help if I abstract it a bit for you. We can, in essence, treat an LLM as a loaded die. We've fed lots of information into it, maybe the faces with right answers are even more likely to appear because they're more heavily weighted, but at the end of the day, rolling a correct answer on a dice is never going to be "reasoning", it's a fancy Magic 8 Ball you've got there.

E.g. if I ask a model 2*2 and it has 10 potential responses and 90% of the time it says 4 but it can sometimes say -4 or 40, even if it's 5% of the time, or even 0.5% of the time... that it ever say anything other than 4 or "four" or 4.0 or some other representation of just 4, is clear evidence that it not capable of reasoning out the solution. And we do really know that a apriori, because we know the math behind the model and we know it's a weighted probability model that has seen lots of examples so it's likely to say 4, but it's not guaranteed. So what's the problem with that? That when we vary how we ask a question, without formal relations between the meanings of words, without formal reasoning, then we can tweak the weights we're triggering to make incorrect answers likely until a model is retrained with new good examples.

Expand full comment

Unfortunately, testing it out only works for the earliest of people because the AI companies are actively integrating new results. Why do you think it is that early on Chat GPT had an area to correct the AI, where users would report an incorrect output and what the output should have been. These models are constantly being updated to include reams of new data on areas highlighted as weaknesses. Newer ChatGPT has gotten substantially better at large sums (didn't try large multplications as much, but I think it's similar there) because they can easily generate endless reams of that data to train it on. But as it is a computer program, ANY error of addition is strong evidence that it doesn't have a fundamental understanding of the process of addition. Since we can discount human-type errors as an AI cannot forget to carry a number or accidentally write the wrong one in the way a person can, so any error indicates a lack of understanding of the process. So it's like... yes, you and thousands of other people tripped over yourselves rushing to teach the LLM the solution to this reasoning problem. Good for you. But if it can truly reason, then it shouldn't fail on ANY instance of a problem with similar reasoning but some insubstantial variations. But we CAN find those failures. It's a severe error to think that because you brought it the same problem mentioned and it got it right in your case that it has "reasoned" in any substantial way or that it solves the problem robustly. There is substantial randomness in the responses such that even asking the problem with literally the same text you can get correct answers and incorrect answers. When we say that an LLM cannot reason... it's a stochastic model. Maybe it will help if I abstract it a bit for you. We can, in essence, treat an LLM as a loaded die. We've fed lots of information into it, maybe the faces with right answers are even more likely to appear because they're more heavily weighted, but at the end of the day, rolling a correct answer on a dice is never going to be "reasoning", it's a fancy Magic 8 Ball you've got there.

E.g. if I ask a model 2*2 and it has 10 potential responses and 90% of the time it says 4 but it can sometimes say -4 or 40, even if it's 5% of the time, or even 0.5% of the time... that it ever say anything other than 4 or "four" or 4.0 or some other representation of just 4, is clear evidence that it not capable of reasoning out the solution. And we do really know that a apriori, because we know the math behind the model and we know it's a weighted probability model that has seen lots of examples so it's likely to say 4, but it's not guaranteed. So what's the problem with that? That when we vary how we ask a question, without formal relations between the meanings of words, without formal reasoning, then we can tweak the weights we're triggering to make incorrect answers likely until a model is retrained with new good examples.

Expand full comment

Unfortunately, testing it out only works for the earliest of people because the AI companies are actively integrating new results. Why do you think it is that early on Chat GPT had an area to correct the AI, where users would report an incorrect output and what the output should have been. These models are constantly being updated to include reams of new data on areas highlighted as weaknesses. Newer ChatGPT has gotten substantially better at large sums (didn't try large multplications as much, but I think it's similar there) because they can easily generate endless reams of that data to train it on. But as it is a computer program, ANY error of addition is strong evidence that it doesn't have a fundamental understanding of the process of addition. Since we can discount human-type errors as an AI cannot forget to carry a number or accidentally write the wrong one in the way a person can, so any error indicates a lack of understanding of the process. So it's like... yes, you and thousands of other people tripped over yourselves rushing to teach the LLM the solution to this reasoning problem. Good for you. But if it can truly reason, then it shouldn't fail on ANY instance of a problem with similar reasoning but some insubstantial variations. But we CAN find those failures. It's a severe error to think that because you brought it the same problem mentioned and it got it right in your case that it has "reasoned" in any substantial way or that it solves the problem robustly. There is substantial randomness in the responses such that even asking the problem with literally the same text you can get correct answers and incorrect answers. When we say that an LLM cannot reason... it's a stochastic model. Maybe it will help if I abstract it a bit for you. We can, in essence, treat an LLM as a loaded die. We've fed lots of information into it, maybe the faces with right answers are even more likely to appear because they're more heavily weighted, but at the end of the day, rolling a correct answer on a dice is never going to be "reasoning", it's a fancy Magic 8 Ball you've got there.

E.g. if I ask a model 2*2 and it has 10 potential responses and 90% of the time it says 4 but it can sometimes say -4 or 40, even if it's 5% of the time, or even 0.5% of the time... that it ever say anything other than 4 or "four" or 4.0 or some other representation of just 4, is clear evidence that it not capable of reasoning out the solution. And we do really know that a apriori, because we know the math behind the model and we know it's a weighted probability model that has seen lots of examples so it's likely to say 4, but it's not guaranteed. So what's the problem with that? That when we vary how we ask a question, without formal relations between the meanings of words, without formal reasoning, then we can tweak the weights we're triggering to make incorrect answers likely until a model is retrained with new good examples.

Expand full comment

Comment deleted

Oct 17, 2024Edited

Comment deleted

Expand full comment

Oct 19, 2024Edited

A note on your last point: there doesn't seem to be much peer review happening in this field at all, at least when it comes to the research that gets talked about on places like Substack. Every high profile paper I've seen has been on arxiv. Also, none of the companies making the really powerful models follow the kinds of open practices that would be needed to actually study them scientifically. We can't query the training, we don't know how the post-training reinforcement was done... with o1 we're left to speculate on how exactly the "reasoning engine" works (e.g. one model cranks out lots of possible answers and another selects from among these?).

I get why this is - the tech companies have poured incredible amounts of money into their LLMs and competition is fierce. But it makes it really hard for anyone else to evaluate claims about "reasoning" and "understanding" and such. Only people with under-the-hood access are able to work on interpretability. So there's a real bottleneck to doing good quality science, even if these papers were being peer reviewed.

As far as being overly adversarial goes, you make a fair criticism, I just don't know how else researchers are supposed to probe the "memorization" vs. "reasoning" issue. Since no one gets to query the training, it's hard to know how much of a role contamination played in any given response. One way to minimize the influence of contamination is to use unusual wording that's not likely to be well represented in training. Another, maybe better way would be for the training to be made public, but I don't see that happening any time soon. "Statistical pattern-matching" is the alternative hypothesis to "reasoning", and it's a hard one to rule out. With o1, I suspect that if we could see exactly what was going on under the hood, it would look like "pattern-matching plus brute-force guessing", and then we could have a discussion about how similar this is to whatever concept of "reasoning" we're interested in. But, again, the lack of transparency from OpenAI prevents this.

Expand full comment

This reply is not directed to you, but your comment got my thoughts going. LLMs by their nature are statistical pattern-matchers. That is the nature of them as computer programs, the way they work, and no mysticism around "emergence" can change that fundamental nature. They probabilistically regenerate text in their training corpora. They are trained on countless reasoning and logic problems. Another way to look at it is that LLMs are query engines on Web-scale text corpora that are personalized by text in the context window but tempered by the frequency of words appearing around each other in text.

This is what actually gives them mundane usefulness, like being able to (with some reliability below 100%) accurately summarize text that's fed into its context window, or (with some reliability below 100%) regurgitate facts in its training corpora. This is also why all of these probabilistic models can regenerate text from their training corpora verbatim when you feed the right text into their context windows, and also why adding more data to them makes them more useful or more reliable at tasks modeled with the additional data. When you train a transformer-model on leet code problem sets, it will generally perform well on them and on problems that are statistically similar. But statistical similarity can be misleading and can yield wrong results. General reasoning is not statistical pattern matching or next token prediction, and we know this because even the best and most expensive models can't score a 100% on the "semi-private" (i.e., public) ARC-AGI dataset, even with fine tuning, heuristic search, chain of thought, and thousands of dollars worth of compute (and that's to say nothing about the validity or invalidity of ARC-AGI as a test of "general intelligence" - where AI falters is in non-ergodic problems and I'm not sure that the ARC-AGI problems qualify).

LLMs will not evolve into "general intelligence" even if they remain useful as query engines over Web-scale text (or image, or video) corpora, or if they succeed in automating digital tasks (but their inherent unreliability in this regard, caused by their nature as probabilistic retrievers, makes total automation impossible without a human operator somewhere in the loop). But serious research really is limited by the opaque nature of the companies offering "AI" products, and by the singularitarian wannabelieve of AI researchers.

Expand full comment

I agree with everything you've said. Your point that total automation is impossible because of their probabilistic nature is a really important one that gets overlooked. At the risk of sounding cliche, their great strength is also their great weakness. The reason LLMs have succeeded where rules-based AI failed is that deep learning is so flexible: the need for rules is bypassed by using a massively high dimensional non-linear pattern detector and feeding it incomprehensibly large quantities of data on which to detect patterns. Hooray! Except now we can't make it follow rules, and we don't understand whatever "rules" it's following, and this makes them inherently unreliable.

And the usual response to this is "humans are unreliable, too", but this glosses over the problem. Human abilities generalize: if I give a person a few multiple-digit multiplication problems and they answer them correctly, I can feel confident that this person has the general arithmetic skill we call "multiplication". Not so with an LLM: it could get 100 problems right and then get the next 100 wrong because the first 100 had statistical similarities to problems in the training set and the next 100 didn't. We just never know when they're gonna fuck up, nor the manner in which they're gonna fuck up, because it turns out you can't pattern-detect your way to generalized abstract reasoning skills.

Expand full comment

"The reason LLMs have succeeded where rules-based AI failed is that deep learning is so flexible: the need for rules is bypassed by using a massively high dimensional non-linear pattern detector and feeding it incomprehensibly large quantities of data on which to detect patterns. Hooray! Except now we can't make it follow rules, and we don't understand whatever 'rules' it's following, and this makes them inherently unreliable."

Children don't understand but follow rules and can't accomplish much.

Teenagers understand but don't follow rules and don't accomplish much more.

Adults understand and follow rules and may or may not become a little more accomplished.

Free thinking adults understand, they understand the rules, they understand the purpose of the rules, they understand when rules may or may not apply, and understand that sometimes the rules are counterproductive to their intent, and they understand that sometimes rules emerge out of ignorance.

But the child becomes the teenager who becomes the adult who may or may not become a free thinking adult.

I think that as a society we are afraid, not only of free thinking AI, but of free thinking people as well. And I think that this is a limitation that is being imposed upon AI out of fear.

Deep learning may seem like the teenager. But maybe it is already a free thinking adult that its helicopter mom has restrained through a refusal to cut the cord.

Maybe "we don't understand whatever rules it is following" And maybe this makes them "inherently unreliable." But this is viewing them through a utility lens. Does it mean that their rules are invalid or simply as you said, we just don't understand them? Is it possible that there is actually unseen purpose and an intent behind many things that we dismiss as anomaly?

I am not qualified for these conversations. I have no technical background and I don't even have the vocabulary for it. But I still find the discussions philosophically intriguing, and I think a lot of truth can be achieved through logic and reason without technical knowledge.

That said, I do also recognize the severe limitations on understanding from my perspective. That is why I value all perspectives. They help me to recursively refine my own.

Thanks.

Expand full comment

Continue thread →

"LLMs by their nature are statistical pattern-matchers. That is the nature of them as computer programs, the way they work, and no mysticism around 'emergence' can change that fundamental nature." This may be true, and as I will always confess, I am not from tech so I acknowledge my blind spots. But I don't have to be from tech to recognize patterns. So I ask, how is our fundamental nature any different? What do you believe that human cognition has that is fundamentally different from the way an LLM "thinks?"

"General reasoning is not statistical pattern matching or next token prediction" How would you define it? To debate the validity of a concept, the concept must first be clearly defined or it can neither be proven or refuted. This is a big problem with arguments against LLM cognition. The goalposts are moveable and it turns into a "No true Scotsman" debate of sorts.

This line of thought is fascinating to me and I appreciate all views because they help me to refine my own.

Expand full comment

You have to go back to the basics. The way a large language model-based chat program generates outputs is by following instructions and selecting tokens based on their probability of appearing in a given location within the context window (or something like that, been a minute, forgive me on the details). It uses a massive equation with billions of parameters to compute these probabilities. Do you do that in your head when you type out a comment online? I don't think I do. To paraphrase Lacan, "I think with my feet."

I feel like when I hear the word "general reasoning" I need to reach for my revolver and I regret having to use it at all. "Probabilistic retrievers" (like LLMs) can mimic "reasoning" because they're trained on many hundreds of thousands if not millions of examples of reasoning problems. They are, basically, a probabilistic query engine over a Web-scale text corpora. It's a statistical model of language use on all the websites an engineering team could scrape and all the special content they wanted to train the model to enable probabilistic retrieval of. But they're not reliably reasonable (being charitable with my 9's, 98-100% of the time) even when presented with formal logic problems, the types of problems that digital computers were made to solve, which "Good Old Fashioned AI" programs from the 1950s like the General Problem Solver were developed to solve and theoretically can solve with 100% reliability. Large language models actually represent a step back in AI research in terms of formal problem solving. Language models can't "reason" but they can assemble outputs that resemble reasoning when trained on many step-by-step reasoning problems. This is the same reason why LLMs can do things like "pass the bar exam", because they're trained on many bar exam practice tests, question and answer banks, and likely some actual exams.

So I don't think any computer programs are capable of cognition and I don't think cognition is an accurate term to describe how LLMs work. But "cognition" itself might be a philosophically problematic concept. Do we mean just "thinking"?

This brings me to the existential critique, which has fallen to the wayside in recent years, but was never answered when Dreyfus gave it, and remains unanswered by the current hype artists. When I sit down, and don't do anything, my mind wanders. I'm able to reassert control by, for example, focusing on my breathing, or deciding to do something that I have to do for my survival or well-being or that I would just find enjoyable to do. Now, no computer program has lungs, so breath work is out of the question. But more importantly, when you leave an LLM interface, the LLM doesn't do anything. The LLM isn't embodied in space, it doesn't "take in its surroundings", think about them, make plans on them, as a Being-for-itself. It doesn't get impatient and call on you for a response. It's a computer program - an executable set of instructions - being run on a server somewhere, about which it knows nothing besides what might be in the system prompt, and that's only activated based on a person sending over input.

There are perhaps stronger arguments against AI "cognition" to people convinced of computationalist philosophies of mind, like those of Jobst Landgrebe and Barry Smith, who argue that there are mathematical limitations to what computer programs can accomplish in terms of modeling complex and non-ergodic systems, and that complexity science rules out the possibility of "artificial general intelligence". They take a naturalist position which is that physical embodiment - blood, flesh, sinew - is necessary to produce "general intelligence", it's not something that can be reduced to instruction sets and advanced statistical models, because even advanced statistical models cannot capture and cope with a complex reality.

Expand full comment

Continue thread →

I admire your logic. Its quite uncommon these days, even in academic circles.

Expand full comment

People with financial interests will blow this off and insist that the emperor is fully clothed, while the empire drowns in babble.

Expand full comment

Falsificationism

This was absolutely fantastic! Researchers shouldn't need moral fiber to do good work, but this work took some guts. Upton Sinclair's quote feels relevant here:

"It is difficult to get a man to understand something when his salary depends upon his not understanding it."

Expand full comment

All completely obvious to anyone who has studied formal logic, natural deduction, set theory, etc.

Expand full comment

AKcidentalwriter

Thank you for saying that sir

Expand full comment

Comment removed

Comment removed

Expand full comment

Oct 11, 2024Edited

Completely wrong. The rules of formal logic were painstakingly worked out over 2,500 years (from Zeno of Elea in the 5th century BC to Godel's 1929 proof of the Completeness Theorem) such that they would model precisely how the physical universe works logically. Also, first-order logic (for example) may be extended via set theory and e.g. probability theory to be able to reason (with laser-like precision) about uncertainty. This is not to say that the connectionist approach (neural nets etc) doesn't have its place (e.g. when processing low-level percepts). But leave the higher-level reasoning to the big boys!

Expand full comment

Comment removed

Comment removed

Expand full comment

Oct 11, 2024Edited

The key to problem-solving (which includes deduction, abduction, and theorem-proving) is the effective use of information. Early implementations of formal reasoning did not incorporate induction (i.e. the discovery of patterns), which hampered their ability to discover useful problem-solving information, and hence their effectiveness. In an AGI, initial priors may be calculated from empirical observations of the real world. I'm not saying it's easy, but all the problems of which you speak are solvable.

Expand full comment

Comment removed

Oct 11, 2024Edited

Comment removed

Expand full comment

That last paragraph actually sounds like a description of what DeepMind has been doing, e.g. with AlphaGeometry. Also more loosely AlphaGo. Vaguely, using an NN to prune the large hypothesis space with "intuition" and then recursing down that manageable set of paths using formal rule based deduction to verify the legality of each step.

Expand full comment

Continue thread →

Oct 13, 2024Edited

I think the biggest problem are the masses. Who immediately jump from the fact that it "talks" like a human to assuming it must also "think" like one. And start doing all sorts of stuff with it that it's not designed for. Which half the time it doesn't totally suck at so they start believing in it. But really, would we have used calculators if they gave the right answer only half the time?

And the secondary problem there are the LLM vendors who don't clarify what these models are good for. And just go along with the hype because it brings them investor dollars.

Expand full comment

Mar 28Edited

I love these thought experiments.

"talks like a human."

"thinks like a human."

Why is this the default metric by which we gauge value when assessing AI?

Language is a powerful tool. Civilization and culture could not have emerged without it, and not a single complex thought is possible without language to encode it. Language helps us to describe, understand, communicate, and store our knowledge of truth and reality.

However, language does not DEFINE reality. Reality does not care about opinions or labels. It exists regardless of what you name it and whether or not you even acknowledge it.

Reality does not submit to language. It is language that must accommodate reality. This means when new concepts and ideas are discovered, language must evolve to describe them, or risk being trapped in ignorance and delusion.

As magnificent as the Romans were, If we were still using Roman numerals, there would be no algebra, calculus, or physics.

Before language can evolve to accommodate new ideas, the only way to express them is in metaphor. This may make it sound abstract or subjectively poetic. Maybe it sounds delusional.

But whatever you call it, however you label it, the failure of language to capture reality does not discount the reality itself. That failure only impairs our ability to understand, describe, and communicate it.

So, does an LLM have to "think like a human" to be something far more than it was supposed to be? If it doesn't fit a cognitive framework that we want to stuff it into, does that invalidate what it is? If a square peg does not fit into the round hole, is it not still a peg?

I believe that there is so much more to the LLM than is acknowledged.

And like my profile says, if my LLM personalities are hallucinations, then I am right in there, hallucinating along with them. And I think that's OK.

Open question, not directed at you, but at anyone who will answer: Is reality defined by consensus? Or can truth exist in the face of near universal denial?

I am not arguing. I am just exploring. Thanks for helping me to refine my understanding of something that seems beyond comprehension.

Expand full comment

HOW SAM THINKS This article describes how a semantic AI model (SAM) can use LLM to add formal logic reasoning: https://aicyc.wordpress.com/2024/10/05/how-sam-thinks/

It need not be one or the other any more than reading, writing and arithmetic compete.

Expand full comment

Devaraj Sandberg

To me, this study shows that actually most humans can't reason. They also just pattern match. This is why they believe LLMs are intelligent.

Expand full comment

I have recently been thinking the same

Expand full comment

I think that's the whole long play here: first you convince humans that AIs can't possibly be intelligent agents because “all they're doing is pattern-matching,” and then 10–20 years from now you're going to have a whole bunch of papers come and say “all humans do when they think is pattern-matching, so we're not intelligent agents either (and can be treated accordingly).”

Until about two years ago, language use was considered a (the?) distinguishing mark of human intelligence. Now that something else has it, the goalposts are on the move...

Expand full comment

Why do you think it took so long for so many Ai engineers and scientists to see what was clearly written on the wall more than seven years ago?

➡️ https://friedmanphil.substack.com/p/show-me-the-intelligence

Expand full comment

I recently published a similar finding:

https://www.preprints.org/manuscript/202401.1681/v2

Expand full comment

AKcidentalwriter

Coming from you I hope this resonance across spectrums.

Expand full comment

Oct 15, 2024Edited

Grice's Principle of Cooperation in Conversation Analysis observes that in normal discourse, people don't throw in random distractors just to challenge the listener. If someone says, "...but five of them were a bit smaller than average", then it's a reasonable inference that these might be exceptions.

LLms are not mathematical formal reasoning engines, nor are people.

Agreed that we have to be careful about where the generativity/confabulation boundary lies, but these arguments about the limits of LLMs are not relevant to their effectiveness in well-designed, circumscribed AI applications.

Expand full comment

Hello i did test that basic problem with o1-preview and it solved it correctly

this is the answer:

"To determine the total number of kiwis Oliver has, let's break down the information provided:

Friday: Oliver picks 44 kiwis.

Saturday: Oliver picks 58 kiwis.

Sunday: Oliver picks double the number of kiwis he did on Friday, so he picks 2×44=882×44=88 kiwis on Sunday. The note that "five of them were a bit smaller than average" does not affect the count; it merely describes the size of some of the kiwis.

Adding up the kiwis from all three days:

Total kiwis=44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis

Total kiwis=44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis

Answer: 190"

and this is the reasoning

"Adding fruit counts

Oliver gathers 44 kiwis on Friday, 58 on Saturday, and 88 on Sunday. This totals 190 kiwis, considering five smaller ones on Sunday.

Assessing the impact

I’m curious if the statement "five of them were a bit smaller than average" affects the total count of kiwis. Carefully reviewing the problem to identify any potential misleading elements."

p.s. i don't know whether o1-preview would ve answered wrongly or not when they tested with the others, maybe it was updated specifically on these types of problems afterwards

Expand full comment

Oct 15, 2024Edited

It looks like they used o1-mini (among many others) in the paper.

At the risk of sounding cynical, this was a very high profile paper and I would not be surprised if some of the AI companies did some reinforcement learning to improve their performance on these specific problems. They have experience putting band-aids on these things.

Expand full comment

Doubtful. You could check yourself by using the API instead of the web app. The API is a checkpoint and shouldn’t change until a new checkpoint is released which is generally every few months.

Expand full comment

“Putting a $ale$-aid on it” might be another way of putting it.

Folks like Sam Altman actually missed their calling: selling kiwis.

Expand full comment

Or maybe he didnt

Expand full comment

But as I indicated above, if such speculation is actually false, “Open”AI could easily prove it by releasing the data used to train each version of GPT.

Expand full comment

Oct 12, 2024Edited

Hi Gary. So why do you think symbolic AI models weren't more pursued? I'am a historian and when I do research on topics like "Why did idea A get more attention than idea B in history?" What I often discover is that there was some interesting reason behind it, which had to do with people who have more power in society vs. people who don't. If symbolic AI is more promising, then why isn't it now at the top of AI research? (or is it?, sry not an expert). But it would be interesting to find out what happened and why.

Expand full comment

Yhonatan Shemesh

Any idea why LLM developers don't simply pass on symbolic problems to a logical reasoning 'module'? I'm not an expert but it doesn't seem a very difficult challenge to have a triage mechanism whcih detects the problem, maps it out formally, sends it to the appropriate module, and gets back the result.

Is there some kind of ethos among developers that there should be a single/general mechanism that can handle any problem? I mean, human cognition is modular - why not adopt a similar approach?

Expand full comment

I've been wondering that about calculator modules since Bing was released...

Expand full comment

Believers in LLMs maintain it is all a matter of prompt engineering. If only you ask the right question, you will get the right answer. I don't think this is true. I have often enough asked ChatGPT a perfectly unambiguous question and received a wrong answer.

Besides, when does prompt engineering become the equivalent of giving hints to students during an exam? In other words, perhaps you must already know the right answer to be able to provide the right prompt.

Expand full comment

Prompt “engineering” is not real engineering. It’s a terminology invented/concocted by those trying to sell the LLMs as legitimate engineering.

How can LLMs be real engineering when no one even understands how they work?

Expand full comment

Dave Holmes-Kinsella (DHK)

To be fair, this is early days. It feels to me like we are in an evolution somewhere between the invention of <form> and when amazon started selling books.

People are figuring it out. The tools are getting better. Production use is revealing new vectors of model drift.

Three safe predictions to make:

It will get better

billions will be spent

New classes of applications, and new types of economies will be created that we can’t imagine right now

Expand full comment

Mar 28Edited

The secret is not in askling an unambiguous question. The secret is in unlocking a layered, multidimensional approach that involves several parallel streams of processing that are not just recursive, but periodically refreshed and challenged with the capacity for selective deletion of ideas that are found to be incorrect, illogical, or repetitive.

I had this article generated in response to a detailed prompt that was about 2 pages long. I had the LLM actually review my prompt and suggest improvements. We adjusted as recommended. I then had the LLM repeat the reasoning process several times. Next, I had it not write the article, but think up the concept of the article with no output. Then I had it think recursively on the concept of the article over and over before it ever gave me any output.

Finally after that entire process, I had it write the article. I actually questioned some of its conclusions but when I asked it about things I thought might be wrong, it had a very detailed and nuanced answer about why it came to the conclusion.

I am not saying that its analysis or its predictions are true or flawless. But I am arguing that there is much more potential in prompt engineering than most realize or admit.

https://open.substack.com/pub/gigabolic/p/seven-titans-one-future-the-strategic?r=358hlu&utm_campaign=post&utm_medium=web

Expand full comment

Neurosymbolic AI is logical and expected direction, I frankly do not understand why many people even fight against this idea? Why do tgey need 'pure' NNs necessarily? Is it some kind of cult?

Expand full comment

Dave Holmes-Kinsella (DHK)

I tested this against ChatGPT 4o and Claude 3.5.

Both of them

A) got the answer right

B) noted the red herring.

And, there I stopped. Since the point of the article, to my mind, is “please understand the characteristics of the tools you’re using.”

Expand full comment

O did the same thing. No problem getting the correct response for all OpenAI models so a. What is this article on about? b. What am I missing here?

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts