46 Comments
Oct 24, 2023Liked by Gary Marcus

When people specify test problems like the first two under the success column (i.e., the invertible matrix and randomly chosen box problems), I really wish they would choose something other than round numbers in specifying the problem. Both of these problems are standard problems that are discussed in many places on the internet, and at least for the first problem some of the top results use 100 as the dimension for the matrix. Therefore, it seems plausible that the training data included the test problem with a solution. Changing the size of the problem to an unusual number, and ideally doing a quick google search to verify that there is no readily findable answer to that version of the problem, would give some confidence that the system isn't just regurgitating a solution from its training data.

As an aside, the more I read about this subject, the more I think the ideal person to investigate claims of AGI would have been James Randi.

Expand full comment
Oct 24, 2023·edited Oct 24, 2023Liked by Gary Marcus

So, in other words, like self-driving cars, LLMs need more data to handle the corner cases? We will need a fundamentally different intelligence model to crack AGI because generative AI, impressive or not, is obviously not up to the task. I sincerely hope that many AGI researchers will abandon deep learning and work on new approaches. Otherwise, no AGI anytime soon.

Expand full comment
Oct 26, 2023Liked by Ernest Davis

I tried you Lake Michigan problem with Bard and it went as follows:

"Yes, it is possible that Joe is telling the truth.

If Joe and Beth live in different directions from Lake Michigan, they could easily be 100 miles apart, even though they both live 10 miles from the lake.

For example, Joe could live in Chicago, Illinois, which is on the western shore of Lake Michigan, and Beth could live in Green Bay, Wisconsin, which is on the eastern shore of the lake. The distance between Chicago and Green Bay is approximately 100 miles."

This doesn't mean Bard can solve every similar problem; it has failed miserably with other ones. Nevertheless, Bard's answer for this one was surprisingly good.

Anyway, I get your point and of course what you say it true.

Expand full comment

At first glance, it might seem impressive that ChatGPT answered correctly some quite complex math problems, but we always need to remember that if it is a standard math problem (such as the 100x100 matrix or the 100 dimensional box), ChatGPT most certainly already had the answer in its training data. On the other hand, the Vega-Sirius-Sun problem is not so common and likely was not in the training data, and therefore ChatGPT failed on it. This is always the problem with empirical tests for AI - an empirical test, if not carefully designed, can always be brute forced by a computer system that has access to vast amounts of data, memory and compute. To test an AI's reasoning abilities we need to be certain that the AI does not have access to the possible answers in its training data (which is practically impossible to do due to the terabytes of training data) or that not all the answers can possibly be in the training data (e.g. an infinite problem space). Or, we scrap the empirical definition of intelligence (I blame Turing for introducing it) and try to come up with a formal, mathematical definition, which of course would require we actually understand how it works.

Expand full comment

I would consider that second “correct” one to be at least partially wrong. It needlessly introduces an approximation to 1 - e^(-2) which results in the decimal approximation being 0.8647 which is not correct (0.8674 is)

Expand full comment
Oct 24, 2023·edited Oct 24, 2023

Excellent article, Gary. Congratulations to Ernest Davis and Scott Aaronson for their fine efforts!

Often a person who does not know the answer to a problem will still be able to detect an incorrect answer and provide additional helpful guidance to the chatbot. This may allow the chatbot to converge on the correct answer. The relationship between the person and the chatbot is typically cooperative and not adversarial

For example, I asked ChatGPT 3.5 about the Lake Michigan problem. It gave an incorrect answer. So I gave the chatbot a hint, and it provided a plausible answer:

Hint from me to Chat GPT 3.5: Suppose the lake was very big?

ChatGPT 3.5: If the lake in question was very large, such that its shoreline extended over a considerable distance, it would be possible for Joe and Beth to both live 10 miles from the lake and still be 100 miles apart from each other. In such a scenario, Joe could be telling the truth. The key here is that the lake's size would need to allow for their distance from the lake to be 10 miles each while also accommodating a 100-mile separation between Joe and Beth.

Perhaps supplementing the training set of an AI system with an extensive collection of detailed transcripts showing people solving problems with tools like Code Interpreter and Wolfram Alpha would enable it to achieve higher performance. The predictions by the AI system would have a higher probability of corresponding to valid and pertinent operations for the tools. It will likely be necessary to distinguish between the input sent to the tools and the output sent from the tools. Admittedly, this is a simple approach, but I think it is worth a try.

Expand full comment

Making an LLM use tools is not going to be a silver bullet, indeed. Since it doesn't understand anything in any way, all it can do is mix and match what it has been trained.

We all want a great non-"parrot" solution and a paradigm shift, but those are hard to do and hard to guess when they may arrive. So, trying to improve what we've got is our only choice for now.

Out of the four problems that failed, problem 2 (Vega) and problem 3 (pendulum) are ones that do not use tricky world knowledge (shape of lake) or high-level math (Shannon entropy).

These should be fixable by giving the tool more examples of how to translate word problems to something more formal.

This is particularly hard for the pendulum problem, as this would require quite a lot of spatial awareness to even understand the setup.

I do not anticipate any quick breakthrough here. GPT will have to be shown lots of examples many classes of easier problems first. It should learn to convert the problem statement to a formal language, to ensure it got even the goal right. Then use a formal verifier or some other symbolic tools to check every little step it makes.

This is hard, yes, but it need not be all solved in the next release.

Expand full comment

"the the Earth and Vega both orbit the Sun, but at different rates"

I'm dying here.

Expand full comment

How many problems in each of the categories did the systems answer correctly?

Expand full comment

Timely. In my Oct 10 talk (video just released, I mention this substack at the end as an example of a source that tells a lot of the right things about GPT and friends) I mention that marrying LLMs and Symbolic AI (such as WA) presents us with unsolved problems, but I had no time to get into it. This is a fine illustration.

The plugins, as well as ChatGPT’s ‘be harmless’ filter, as well as prompt engineering can all be seen as ‘trying to work around the fact that LLMs fundamentally have no understanding’. In the case of the ‘dumb be-harmless filter’ this is even funny, as ChatGPT will happily flag its *own* output as potentially inappropriate…

The talk (https://www.youtube.com/watch?v=9Q3R8G_W0Wc) makes non-technical people understand that the errors are not ‘solvable’, but are a fundamental aspect of these systems, by taking them step by step through the functional behaviour of the LLMs without addressing the irrelevant details of transformer architecture etc. And with respect to sizing, a quick and dirty calculation shows that for one task (I did not do all 40 of the GPT3 paper) you need models about 10,000 to 100,000 *times* as large to get in the error range of humans (still without reasoning/logic/math/understanding of course, as that is a fundamental issue)

Expand full comment

To me GPT is a user interface, viewed like that, it's a spectacular advance, sort of Alexa on steroids. Transforming a language inputa to outputs, with unprecedented flexibility, but at a cost of lower accuracy. So typing or saying 5hings like "what is the volume of a sphere with radius 1" would work well, easier than a dedicated Wofram laguage. Or in Excel saying something like "group users by department and calculate the average" it's easier than click click on menus. So a better user friendly user interface.

Expand full comment