63 Comments
User's avatar
Kevin D's avatar

Actually this is one of the best post by Gary, not because other posts are bad, but because I am a mathematician and can actually appreciate Gary’s very reasonable take here. Most of my colleagues share the same kind of feelings in Gary’s post including experts in algebraic number theory: it is a big deal for OpenAI because the problem itself is famous, the difficulty and complexity of the solution is actually similar to previous AI solutions of less famous Erdos problems. To me, it feels like AI solving Goldbach conjecture but by providing a counterexample. I also agree with Gary’s point (3): pure math is prob the least profitable area in the world, the demand for pure math is literally 0 outside the community. But everyone regards pure math as a field that requires enormous intelligence (and for certain fields of pure math, it is easy to verify). So this makes math as a perfect advertising subject for AI companies.

Gary Marcus's avatar

note that a lot of the smart thinking here is from Cal Newport :)

Anatol Wegner, PhD's avatar

I believe at this stage one should be extremely cautious about company demos. The proof/counterexample was obtained by an undisclosed custom internal model that seems to have been specifically tailored/fine tuned to the problem by two top mathematicians Mark Sellke and Mehtaab Sawhney with the help of another CS prodigy Lijie Chen. They probably iteratively fine tuned the scaffolding, model, context/RAG and verifier and biased it towards promising strategies (all of which can then conveniently be hidden behind the undisclosed internal model) and then claimed an autonomous result when the model after having a 125 page random walk was able to tumble over the finishing line under the watchful eyes of expert mathematicians who then picked up the output, checked it and turned it into an actual proof. Not exactly what I would call autonomous.

Finally I think the deeper reason behind the recent drive of AI labs towards mathematics is that if these systems can't be made to do mathematics all the talk of AGI, super intelligence, recursive self improvement becomes essentially vacuous.

Scott Joy's avatar

Nice to see you are linking to Ed Zitron ... Hopefully, in the spirit of collaboration, he will do the same.

Saar Drimer's avatar

If they were serious scientists they'd write it up, submit it, let it be peer reviewed and then published. Oh, and make sure that the results and methods are reproducible.

But we now need to accept their word with a horrible video of awkward grad students writing with chalk, and editing from hell. What a time to be alive.

Jeff Wu's avatar

They did publish the result with proof, and with comments by many mathematicians, including Timothy Gowers.

Anatol Wegner, PhD's avatar

The "awkward grad students" are actually the future fields medalist caliber mathematicians who were needed to babysit the model into producing the result. Which tells us all we need about how autonomous the result actually was.

Oaktown's avatar

OpenAI needs a new name: I propose ClosedAI or HiddenAI

Matt Newell's avatar

It's a maths counterexample, why would the methods need to be "reproducible" (whatever that means)

cope harder

Inside The Black Box's avatar

95% of Nvidia's operating cash flow absorbed by circular financing, up from 57% a year ago. The entire AI boom is running on money that's going in circles getting bigger each time.

nAxis's avatar

It's a type of derivatives market onto itself.

M. E. Black's avatar

Seems to me that the LLM found a counter-example by trawling and interpolating through its training configuration, with aid from the stochastic massaging given by the proof strategies inserted by the human beings who prompted the chatbot. No indication that the chatbot "understood" it had come across such an important finding, or where in the "chain of thought" transcript it was (was it at the very end, with a QED, or did it just continue on a wild goose chase after that, as is so many people's experience with these models?)

Seems to me yet another verification that LLMs are probabilistic retrievers, not general reasoning machines. If they were general reasoning machines, *all* of the Erdos problems would be solved. It is very likely that mathematicians within and outside of OpenAI are throwing every problem they can at these programs, and what does the success rate look like given that? Probabilistic retrieval programs have a market, I'm sure, even when used to produce "reasoning-shaped" text outputs of varying degrees of rationality and quality. Mathematicians probably do like that they can be used to solve some equations (especially attached to formal verifiers) that previous iterations of problem-solving programs got hung up on. We all know, still, that "chain of thought" fails frequently and is subject to both factual "hallucination" errors as well reasoning errors. A big grain of salt is needed here. This actually seems *less* impressive to me than the other Erdos problem solutions because of the added details here.

On the scaling of the training data question, I have it on good authority that Google's training data cutoff is still 2025, and that advances in their models' performance on benchmarks are gamed not by throwing MOAR DATA at the models during pre-training but by better structuring the training configuration and through untold hours of RLHF.

Anatol Wegner, PhD's avatar

The proof/counterexample was obtained by an undisclosed custom internal model that seems to have been specifically tailored to the problem by two future fields medalist caliber mathematicians Mark Sellke and Mehtaab Sawhney with the help of another CS prodigy Lijie Chen. They probably iteratively fine tuned the scaffolding, model, context/RAG and verifier and biased it towards promising strategies (all of which can then conveniently be hidden behind the undisclosed internal model) and then claimed an autonomous result when the model after having a 125 page random walk was able to tumble over the finishing line under the watchful eyes of expert mathematicians who then picked up the output, checked it and turned it into an actual proof. Not exactly what I would call autonomous.

Larry Jewett's avatar

How about “Full Self Mathing (Supervised)”?

Aiman Najjar's avatar

This seems similar to what Mythos "achieved", discovering bugs in tools that no security researcher feels incentived to spend longs hours looking for. It's not intelligence if it's based on previous human techniques.

They're just too desperate now to prove something that should be glaringly obvious, an intelligent machine would never tell you to walk to the car washer, confusing investors with complex math formulas is same the hype tactics they used with Mythos

Tom Dietterich's avatar

I think security researchers are adequately incentivized to do this work. They are not lazy, either. Many of the bugs were found in critical systems that have been heavily analyzed by people. But computers can work faster than people (which is, after all, one of the reasons we invented them in the first place).

wh1stler's avatar

results are results, and if said results wouldn't have been achieved without using the AI (because there aren't enough human researchers to chase down every possibility or what have you), characterizing what the AI is doing as definitional intelligence or not seems largely academic

Aiman Najjar's avatar

I disagree. There is this reality and there is the fantastical hyped image those CEOs portrayed and used to justify layoffs, over-investment and the devouring of natural resources. Results are not results when you have the full picture. Real people suffered and serious economical damage has been inflicted because they dressed up underwhelming results as some magical intelligent machine.

Many experienced engineers have been repeatedly saying "this cannot be more than just a tool" - it's a good tool if you market it as such and acknowledge its limitations, but how did the sociopathic CEOs respond to that? Story after story about some "scary" Myhos or some flashy mathematical headline that doesn't capture the underwhelming part of the story.

Anatol Wegner, PhD's avatar

"I believe if the level and type of human expertise that is represented on this note had been assembled to find a counterexample to this conjecture a month ago, and those people put in similar amounts of time working on it than they did to reading and

thinking about Chat GPT’s solution, the mathematicians would have found a counterexample." - Melanie Matchett Wood in the companion/remarks paper by OpenAI.

Mitchell Harper's avatar

I'm most curious to know how much training data you need to reproduce this process. A reasonable conjecture is that scanning the arXiv and selected math textbooks produces enough reasonable signal to noise ratio to reproduce a result like this (given a fixed set of scaffolding, integrations with static programs, and heuristics). Maybe there is some long tail of results where the "large" in large language model makes a significant difference, but if this result can be reproduced without the "large" training corpus, where is the moat*? (on top of mathematics being the place where the marginal revenue of software tooling is constantly being driven towards zero, see Sage etc.)

Tom Dietterich's avatar

My guess is that the training process makes heavy use of reinforcement learning from verified feedback. In other words, many thousands of hours of having LLMs struggle to solve problems and then reinforcing the solution paths. That is equivalent to a very large amount of training data. The cost of computation is probably the only moat.

Mitchell Harper's avatar

I am open to that, while also saying my intuition is that the nature of this specific result makes me wonder how much that matters if you have symbolic methods for theorems that can give you a quick and deterministic reinforcement pass. (Edit: when drafting this reply, I assumed when you said verified feedback, you meant input from human users using the product validated by some process, I don't think I read your comment entirely before replying).

Oaktown's avatar

The LLM equivalent of "teach for the test"?

Tom Dietterich's avatar

The counterexample is a new result, so I don't think it is possible to teach to this test. But it is certainly focused training on mathematics as opposed to, say fluid dynamics or epidemiology.

Mitchell Harper's avatar

Corollary, what if the "advances in scaling" we are seeing are in fact advances in R&D based on heuristics and carefully processed training data being hidden behind an LLM like frontend that hits a router to special processes that do not in fact significantly benefit from scale except at some insignificant long tail and therefore there is only a matter of time before companies whose entire existence doesn't depend on R&D based public relations efforts reproduce the same results to undercut the costs of the providers bleeding money?

Tom Dietterich's avatar

"hidden behind a front end"? It sounds like you are implying that the learned knowledge is not stored in the weights of the network. We are certainly seeing research trending in that direction by adding various databases, knowledge graphs, and memory systems around the LLM. But the hard core LLM folks (esp. at OpenAI) have a deep commitment to connectionism, so I think the most likely route they have followed is massive RLVR to train the weights of the main transformer. It is a very computationally-intense way of getting knowledge into the system!

Jess H. Brewer's avatar

The AI employed "systematic and patient exploration of techniques and corners of problem spaces that are too exhausting to interest most human mathematicians" (your words). So the systematic patience and dogged determination of AIs are clear signs of their intrinsic inferiority to us godlike humans, who can't be bothered with such hard work? Good luck with that.

William Bowles's avatar

Patience, exhausting? Interesting choice of words that reveals so much about the real nature of AI. Back in Turing's days, the computer could do in hours what took an army of people (mostly women), weeks, months to do and real tedious with it as well; long hand computing. But algorithms dont get impatient or exhausted, they just do what they're programmed to do and with the continuous support of an army of mathematicians doing the real work. Hmmm... that reminds me of something...

Jess H. Brewer's avatar

Sure, LLMs are "programmed". They are just "computers doing what they're told", while our divinely inspired human mathematicians do "the real work". Do you really believe LLMs are "just programmed computers" after all these changes, or are you just pandering to the ignorance and fear of your readers?

William Bowles's avatar

I assume youre taking to me and if so, to answer your question, then YES, they''re still just programmes, sophisticated, complex, adaptable (with human assistance of course) but they're still 'just' software, bits and bytes still don't add up to intelligence no matter how they behave.

Patricio Rodriguez's avatar

I guess "systematic patience and dogged determination" could be rephrased as "combinatorial brute force" to better portray what's really happening. Have a great day Mr White!

Jess H. Brewer's avatar

Humans can't be bothered with digging ditches, so they build a machine to do it for them, and subsequently make a point of urinating on the machine every day so it'll know its place.

richardstevenhack's avatar

"So, this experiment might be more about marketing the power of their new model than trying to actually advance computer-aided math."

Where have we seen that pattern before? Ah, yes, Anthropic's Mythos.

As for the persistence an LLM brings to hard problems, that reminds me of the guy recently who recovered a lost crypto wallet password... by having the LLM try over a TRILLION password brute force attempts.

Obviously not smarter than humans... OTOH, the guy losing his password while stoned doesn't exactly represent the human race very well, either. Replacing him with a machine might be a net win for the species. :-)

Darko Mulej's avatar

Impressive, if one remembers that 3 years ago Stephen Wolfram had to step in to help with basic arithmetic.

Almost by definition, the first breakthrough is met with scepticism — and rightly so.

But who would bet this is the last time AI surprises us in mathematics — or that the surprises will stay confined to mathematics?

Tom's avatar
19hEdited

We get it—there are always caveats. The technology is still flawed and there is significant room for improvement. That said, we need to be mindful not to turn this space into an echo chamber in its own right. Look, when a skeptical stance is tied to personal or professional identity/brand, the skepticism itself should be met with, well, skepticism. Those conditions are ripe for the same motivated reasoning and confirmation bias that tech hypers are accused of using. If an intellectual legacy is staked on the narrative that the technology cannot ever achieve a certain milestone, any breakthrough result represents an existential threat to that reputation. To survive it, dogmatic skeptics instinctively deploy predictable defense mechanisms, moving the goalposts to reclassify a cleared benchmark as a mere statistical trick, or hyper-fixating on isolated failures while ignoring legitimate successes. Ultimately, both sides force empirical data into a pre-established narrative, and, with respect, I see this here all the time. True skepticism must remain unattached to outcomes; when a critique reads more like a professional defense mechanism than an objective analysis of capability, it warrants a second look.

Kevin Zatloukal's avatar

This example seems similar in some deep sense to the coding cases where LLMs can exceed human performance such as finding exploits for bugs and end-to-end debugging.

If I can attempt to summarize it, the similarity seems to be that these larger LLMs are great at working with lots of details at once, keeping a hundred different specific variables in mind and finding a way to get them to all fit together. That is hard for human beings, while it seems ideally suited for LLMs with multi-head attention.

(On the other hand, they still seem bad at big picture understanding and realizing what "makes sense" and what doesn't. At least, for me, Opus 4.7 is still quite bad at this.)

nAxis's avatar

"A computer is just a bunch of stupid on-off switches... but it does its arithmetic very, very fast."

Philip Crawford's avatar

Fine, but the problem here seems to be about the nature of the claims made following the outcome achieved. If one simply framed this as another enhancement of human productivity by a machine working through areas of a problem space to make it easier for humans to digest and interperet, what's wrong with that? Productivity was enhanced, surely that's fine?

D’AngelLuddit's avatar

The marketing “brilliance” of this demonstration is grounded in Americans’ near total lack of mathematical reasoning.

Richard H. Serlin's avatar

Bloom notes that the AI’s success with the Erdos problem was partly due to the fact that it explored an avenue that humans considered unlikely, and not worth spending a huge amount of very costly human time on. I am a former state junior chess champion, so I know a bit about chess and computers, and one of the things that makes computers so powerful is that they will explore things that look very unlikely to a human, and often uncover powerful tactics. A human just could not justify spending the time to follow these paths, or would not even think to look, as the human brain must immediately greatly narrow what it looks at in many move variations.