There are two kinds of AI researcher: (1) those who already know that LLMs (by themselves) are not the route to human-level AGI, and (2) those who need to spend 10-20 years and $100 billion working that out.
My 1976 vintage ELF has a dedicated math ROM alongside its interpreter. Later PCs used math co-processors. I don't understand why LLM devotees seem to shun hybrid processing solutions as sacrilegeous...
I would be a little more generous. I think it's understandable that people would want to see how far the technology can be pushed. But I agree with you that its limitations are fairly evident.
I think it is a worthwhile endeavour in order to test the proposition that Gary Marcus has identified: can an LLM extract and encode a set of general underlying rules from specific training examples? Multiplication is a simple example on which to pit an architecture's wits, as it were. As Marcus points out, the pattern of performance suggests that this particular example hasn't managed to encode rules that work on multiplication problems of arbitrary size, even though we know those rules exist and are pretty simply expressed.
ChatGPT's Wolfram plugin, their MixerBox Calculator plugin, and their Advanced Data Analysis tool are all hybrid systems which can do math. I think AI will be brought to existing interfaces like notebooks, spreadsheets, accounting software, rather than trying to do everything through "chat".
Good article. The twitter thread is enlightening. There is a set of people who want LLMs to "evolve" into a general intelligence and so want to prove they can do everything and are not just stochastic parrots.
Maybe it's just the Lisp programming instinct in me, but my first thought is that DALL-E doesn't understand math either – just look at all those unclosed parentheses!
Made me laugh out loud with that extra column on the table. Brilliant. I suspect some people will miss the very fundamental critique behind that.
My biggest fear is that the 'house of cards' aspect (which exists next to a usable aspect) of LLMs will come crashing down before I can publish my video (which has not only errors but also illustrates how they come to be) of my talk last week.
hah, that paper is indeed great, but it proves exactly the opposite as its title suggests. Also, there are some notable weak points. First, the training set contains numbers with up to 12 digits, the evaluation set also contains only numbers with up to 12 digits - to demonstrate that the LLM has indeed learnt the rules of arithmetic tests with numbers larger than in the training set should be performed, and I can bet on the results - 0%. Second, the evaluation set contains only 9,592 cases - this seems woefully inadequate - considering that test cases are very easy to generate automatically it would actually make sense to test on all possible combinations, or at least on a much larger amount. And third, the authors state that the evaluation cases were from the same distribution as the training cases - that makes no sense, numbers don't have a distribution, if anything numbers are evenly distributed, so essentially have a uniform distribution. I suspect what they meant is that they probably used a special random number generator (with non-uniform distribution) for generating their cases, and if the distribution is very compact (small variance) the chance of getting a lot of evaluation cases the same as some training case becomes quite high.
For statistics, I use mathcracker.com. It does a fine job with regression analysis. And for multiplication I use my calculator. I always feel like I am an old fogie. Thank you for making me feel better about myself, ha, ha. Loved your Guitar Zero book by the way.
If a general purpose AI driven problem solver is seeked for, maths (including logics) will have to be incorporated in a rigorous numerical form. No other option makes any sense.
It'd be interesting to see if you could program a transformer "by hand" to solve large multiplication problems. Is it a failure of the learning algorithm or a limitation of the neural architecture itself? I would think since there's a finite number of layers, it would have a limit on the number of carries it could perform, but you'd expect a model with dozens of layers and billions of parameters to be able to perform perfectly up to some number of digits.
It would also be interesting to see if it can induct binary multiplication (which is very simple) better than decimal multiplication.
Oct 20, 2023·edited Oct 20, 2023Liked by Gary Marcus
Interestingly, humans can multiply arbitrarily long numbers, in principle, with no errors, given time and care. It is just very tedious. Admittedly, they need to use pencil and paper,
"In principle" is doing a lot of work there. I'd be surprised if the median person with a high school diploma can multiply 40 digit numbers with very high accuracy. I'd guess <80% getting every digit correct, though the experiment might be hard to recruit for.
Though you're point is correct that humans "know how to multiply", and could eg correctly identify where an error occurred when it's pointed out. I tried for about 15 minutes a few weeks ago to get GPT-4 ("showing all work" with various prompts), but it couldn't do it on numbers with even a few digits. It would quickly generate correctly formatted gradeschool arithmetic computations that were totally wrong.
What if this actually worked out? Would it change the perception of current LLMs?
Suppose it turns out the necessary ingredients are available, but LLMs just “don’t get it completely” :O
Totally agree, if someone claims LLMs understand multiplication, someone else has to point out that it is unreliable and not even close to the simplest calculators.
But I feel the real point lost focus. What has a LLM that sometimes provides the correct result, oftentimes not, “learned”? What is wrong with what it has “learned”? And how could we do better?
Since it seems the authors of the paper made their code and data available, some answers to those questions might follow...
Yeah, either direction would be interesting. If there is some fundamental limitation, that might give us some way of directly measuring and improving on their limitations.
Another area I was curious about, and didn't see an immediate answer in their code, was how the multiplication questions were tokenized. (If I get some free time, I'll play around with it more.) LLMs typically use a token encoding technique based on compression (byte-pair encoding), which seems fine for natural language, which has a lot of redundancy in how it's encoded. But with random multiplication examples, the fact that "911" is a common substring in natural language is a hindrance to getting the right answer. Every digit contains useful information.
Tokenization is IMO a useful window into cognitive limitations of LLMs. They don't "see" how words are spelled, they see how words are tokenized, and have to infer some aspects of words from context. I think this is one reason why they're not very good at rhythm in generating poetry. It's somewhat impressive that they can fairly reliably perform tasks like "add a space between each character in the following sentence: ..."
hmm, the inability to do arithmetic might become the XOR moment for LLMs :)
(back in 1969 single layered perceptrons were shown to be incapable of performing the XOR (exclusive OR) logical operation, which lead to the first abandonment of neural networks by the AI community)
Interesting point, but the XOR point turned out not to be the killer argument it was claimed to be; some might argue that this was a bad thing as it slowed down development of neural networks, which have proven to be powerful and useful, even if there is disagreement as to how powerful.
It did kill the single layer perceptron. As for neural networks in general, I would argue that they are a dead end in AI and basically a waste of time. Probably the only benefit is to demonstrate how AI should not be done. I think a great analogy is how air travel developed in the 19th and 20th century. First (because the underlying science was easy - buoyancy) lighter than air aircraft such as balloons and dirigibles made the head lines by achieving what was previously thought impossible and promised mass commercial air-travel, only to be proven fundamentally flawed and eventually giving up to airplanes (which developed much slower due to the complexity of the technology and the science of fluid dynamics).
In the future all LLMs will be hybrid. They will always defer to calculator-like modules to do numerical computations. That will guarantee the computation is 100% right, but just as importantly it will be millions of times faster.
I think getting an AI to do math "the hard way" is a reasonable research niche to explore, though. Certainly you'd expect an AGI could do math, no matter how tedious, so today's LLMs are clearly not AGI, for this and many other reasons. But that large *language* models alone cannot do math does not seem surprising to me at all, and it doesn't seem like a big concern. We've been doing math with computers for nearly 80 years, we can always add that back in.
ChatGPT is already using a calculator and god-knows what other tools to augment its performance, but that's not the point. The point is that if LLMs are unable to learn even the simple rules of arithmetic, that exposes them as the stochastic parrots that they are, and until at least learning arithmetic can be demonstrated all other talks of AGI and building world models are just bla, bla, bla.
LLMs today are useful for many things. Math is not one of them. Since they are large language models, and not math models, this strikes me as completely expected.
LLMs do well at some tasks and they do poorly at some tasks. That's generally how most tools are, I've found. If you try to drive a screw with a pair of scissors, it doesn't work very well. It took a lot of trial and error to gain intuition for what GPT-4 does well and what it does poorly. I tend to only use it for things where it has a track record of being useful to me.
What's interesting to me is I highly believe the set of things which neural nets can do well will grow significantly in the coming years. Whereas what my scissors can do is pretty much fixed. So I think that is why a lot of people are excited about AI -- it's the trajectory not necessarily their current capabilities. At least that's what excites me about them.
I think there are ways to construct a system, which is using a mechanism other than stochastic parroting, that is linguistically competent but with no understanding. Probably more than one way to skin the language simulation cat.
my background is in math and one thing I learnt was that If your proofs or conjectures are not particularly interesting, then journals will not publish them. Rather, having intuition to know what is important and what isn't is perhaps the greatest skill.
Looking at papers like this, anyone who knows anything about LLM's would guess that they are terrible at multiplication. Who would have thought that finding statistical regularities among (10^12)^2 combinations of numbers would be hard...... It is not a good sign if a field finds papers like this interesting enough to publish.
In any case isn't the obvious solution to just attach wolframalpha + a classifier that says when to use wolframalpha?
There are two kinds of AI researcher: (1) those who already know that LLMs (by themselves) are not the route to human-level AGI, and (2) those who need to spend 10-20 years and $100 billion working that out.
My 1976 vintage ELF has a dedicated math ROM alongside its interpreter. Later PCs used math co-processors. I don't understand why LLM devotees seem to shun hybrid processing solutions as sacrilegeous...
hubris, plain and simple
I would be a little more generous. I think it's understandable that people would want to see how far the technology can be pushed. But I agree with you that its limitations are fairly evident.
I think it is a worthwhile endeavour in order to test the proposition that Gary Marcus has identified: can an LLM extract and encode a set of general underlying rules from specific training examples? Multiplication is a simple example on which to pit an architecture's wits, as it were. As Marcus points out, the pattern of performance suggests that this particular example hasn't managed to encode rules that work on multiplication problems of arbitrary size, even though we know those rules exist and are pretty simply expressed.
ChatGPT's Wolfram plugin, their MixerBox Calculator plugin, and their Advanced Data Analysis tool are all hybrid systems which can do math. I think AI will be brought to existing interfaces like notebooks, spreadsheets, accounting software, rather than trying to do everything through "chat".
"My (innately-programmed) calculator by contrast has received no training at all."
Haha. Yes. A true intelligence would use a calculator. After all, the use of tools is known to be a sign of intelligence.
Good article. The twitter thread is enlightening. There is a set of people who want LLMs to "evolve" into a general intelligence and so want to prove they can do everything and are not just stochastic parrots.
Afaic, they are grammatically 'stochastic parrots' and semantically 'stochastically constrained hallucinators'.
Maybe it's just the Lisp programming instinct in me, but my first thought is that DALL-E doesn't understand math either – just look at all those unclosed parentheses!
))!
Made me laugh out loud with that extra column on the table. Brilliant. I suspect some people will miss the very fundamental critique behind that.
My biggest fear is that the 'house of cards' aspect (which exists next to a usable aspect) of LLMs will come crashing down before I can publish my video (which has not only errors but also illustrates how they come to be) of my talk last week.
hah, that paper is indeed great, but it proves exactly the opposite as its title suggests. Also, there are some notable weak points. First, the training set contains numbers with up to 12 digits, the evaluation set also contains only numbers with up to 12 digits - to demonstrate that the LLM has indeed learnt the rules of arithmetic tests with numbers larger than in the training set should be performed, and I can bet on the results - 0%. Second, the evaluation set contains only 9,592 cases - this seems woefully inadequate - considering that test cases are very easy to generate automatically it would actually make sense to test on all possible combinations, or at least on a much larger amount. And third, the authors state that the evaluation cases were from the same distribution as the training cases - that makes no sense, numbers don't have a distribution, if anything numbers are evenly distributed, so essentially have a uniform distribution. I suspect what they meant is that they probably used a special random number generator (with non-uniform distribution) for generating their cases, and if the distribution is very compact (small variance) the chance of getting a lot of evaluation cases the same as some training case becomes quite high.
For statistics, I use mathcracker.com. It does a fine job with regression analysis. And for multiplication I use my calculator. I always feel like I am an old fogie. Thank you for making me feel better about myself, ha, ha. Loved your Guitar Zero book by the way.
If a general purpose AI driven problem solver is seeked for, maths (including logics) will have to be incorporated in a rigorous numerical form. No other option makes any sense.
It'd be interesting to see if you could program a transformer "by hand" to solve large multiplication problems. Is it a failure of the learning algorithm or a limitation of the neural architecture itself? I would think since there's a finite number of layers, it would have a limit on the number of carries it could perform, but you'd expect a model with dozens of layers and billions of parameters to be able to perform perfectly up to some number of digits.
It would also be interesting to see if it can induct binary multiplication (which is very simple) better than decimal multiplication.
Interestingly, humans can multiply arbitrarily long numbers, in principle, with no errors, given time and care. It is just very tedious. Admittedly, they need to use pencil and paper,
"In principle" is doing a lot of work there. I'd be surprised if the median person with a high school diploma can multiply 40 digit numbers with very high accuracy. I'd guess <80% getting every digit correct, though the experiment might be hard to recruit for.
Though you're point is correct that humans "know how to multiply", and could eg correctly identify where an error occurred when it's pointed out. I tried for about 15 minutes a few weeks ago to get GPT-4 ("showing all work" with various prompts), but it couldn't do it on numbers with even a few digits. It would quickly generate correctly formatted gradeschool arithmetic computations that were totally wrong.
What if this actually worked out? Would it change the perception of current LLMs?
Suppose it turns out the necessary ingredients are available, but LLMs just “don’t get it completely” :O
Totally agree, if someone claims LLMs understand multiplication, someone else has to point out that it is unreliable and not even close to the simplest calculators.
But I feel the real point lost focus. What has a LLM that sometimes provides the correct result, oftentimes not, “learned”? What is wrong with what it has “learned”? And how could we do better?
Since it seems the authors of the paper made their code and data available, some answers to those questions might follow...
Yeah, either direction would be interesting. If there is some fundamental limitation, that might give us some way of directly measuring and improving on their limitations.
Another area I was curious about, and didn't see an immediate answer in their code, was how the multiplication questions were tokenized. (If I get some free time, I'll play around with it more.) LLMs typically use a token encoding technique based on compression (byte-pair encoding), which seems fine for natural language, which has a lot of redundancy in how it's encoded. But with random multiplication examples, the fact that "911" is a common substring in natural language is a hindrance to getting the right answer. Every digit contains useful information.
Tokenization is IMO a useful window into cognitive limitations of LLMs. They don't "see" how words are spelled, they see how words are tokenized, and have to infer some aspects of words from context. I think this is one reason why they're not very good at rhythm in generating poetry. It's somewhat impressive that they can fairly reliably perform tasks like "add a space between each character in the following sentence: ..."
hmm, the inability to do arithmetic might become the XOR moment for LLMs :)
(back in 1969 single layered perceptrons were shown to be incapable of performing the XOR (exclusive OR) logical operation, which lead to the first abandonment of neural networks by the AI community)
Interesting point, but the XOR point turned out not to be the killer argument it was claimed to be; some might argue that this was a bad thing as it slowed down development of neural networks, which have proven to be powerful and useful, even if there is disagreement as to how powerful.
It did kill the single layer perceptron. As for neural networks in general, I would argue that they are a dead end in AI and basically a waste of time. Probably the only benefit is to demonstrate how AI should not be done. I think a great analogy is how air travel developed in the 19th and 20th century. First (because the underlying science was easy - buoyancy) lighter than air aircraft such as balloons and dirigibles made the head lines by achieving what was previously thought impossible and promised mass commercial air-travel, only to be proven fundamentally flawed and eventually giving up to airplanes (which developed much slower due to the complexity of the technology and the science of fluid dynamics).
Good analogy. I suspect similar. It's not a complete waste time but it is not the silver bullet it's (VC hungry) proponents proclaim.
Forever the faithful a critic and joyful skeptic...thanks for the grounding...
Thank you for writing _The Algebraic Mind_.
what does backtracking mean in a word-predictor? i think you are ascribining internal machinery that is absent
In the future all LLMs will be hybrid. They will always defer to calculator-like modules to do numerical computations. That will guarantee the computation is 100% right, but just as importantly it will be millions of times faster.
I think getting an AI to do math "the hard way" is a reasonable research niche to explore, though. Certainly you'd expect an AGI could do math, no matter how tedious, so today's LLMs are clearly not AGI, for this and many other reasons. But that large *language* models alone cannot do math does not seem surprising to me at all, and it doesn't seem like a big concern. We've been doing math with computers for nearly 80 years, we can always add that back in.
ChatGPT is already using a calculator and god-knows what other tools to augment its performance, but that's not the point. The point is that if LLMs are unable to learn even the simple rules of arithmetic, that exposes them as the stochastic parrots that they are, and until at least learning arithmetic can be demonstrated all other talks of AGI and building world models are just bla, bla, bla.
LLMs today are useful for many things. Math is not one of them. Since they are large language models, and not math models, this strikes me as completely expected.
C’mon just the other day I was told they were AGI
The percent of serious people who think we have AGI today is zero.
So, they are stochastic parrots then, because language without understanding is just that
LLMs do well at some tasks and they do poorly at some tasks. That's generally how most tools are, I've found. If you try to drive a screw with a pair of scissors, it doesn't work very well. It took a lot of trial and error to gain intuition for what GPT-4 does well and what it does poorly. I tend to only use it for things where it has a track record of being useful to me.
What's interesting to me is I highly believe the set of things which neural nets can do well will grow significantly in the coming years. Whereas what my scissors can do is pretty much fixed. So I think that is why a lot of people are excited about AI -- it's the trajectory not necessarily their current capabilities. At least that's what excites me about them.
I think there are ways to construct a system, which is using a mechanism other than stochastic parroting, that is linguistically competent but with no understanding. Probably more than one way to skin the language simulation cat.
my background is in math and one thing I learnt was that If your proofs or conjectures are not particularly interesting, then journals will not publish them. Rather, having intuition to know what is important and what isn't is perhaps the greatest skill.
Looking at papers like this, anyone who knows anything about LLM's would guess that they are terrible at multiplication. Who would have thought that finding statistical regularities among (10^12)^2 combinations of numbers would be hard...... It is not a good sign if a field finds papers like this interesting enough to publish.
In any case isn't the obvious solution to just attach wolframalpha + a classifier that says when to use wolframalpha?
Putting a positive spin on a disappointing result is as old as science papers. Especially where there is VC money to be had.