25 Comments

Elementary Euclidean geometry was solved a long time ago. It is a known result of mathematical logic that such a geometry is a decidable theory: there exists an algorithm that solves any problem in the language of elementary euclidean geometry after a finite number of steps. It was found many decades ago. So, strictly speaking, that's an area of mathematics that doesn't require creativity anymore. Deepmind found an efficient algorithm, but I am not surprised at all.

Expand full comment

That algorithm mentioned by you sounds interesting. Could you please elaborate?

Expand full comment

Look e.g. at this:

https://en.wikipedia.org/wiki/Tarski%27s_axioms

There are many results in mathematical logic that shows that some particular branch of mathematics can be axiomatized in such a way that the truth or falsehood of every statement can be decided in a finite amount of time, and a proof is provided. These decision algorithm are usually complex, but basically reduce the notion of truth to a computation. Many branches of mathematics are not decidable, for example the theory of natural numbers (number theory). This means not only that we don't have an algorithm that can solve all problems of number theory, but that it is *mathematically impossible* that such an algorithm exists. Thus we say that mathematics, in general, requires non-algorithmic creativity.

Expand full comment

And while this kind of realistic analysis is what the world needs when these math olympiad results are discussed, the world will get swamped by nonsensical hype about AGI instead. Sigh.

Google is really weird. It marries (often) very good science and engineering with (often) over the top marketing and buzz.

It's a different subject, but the same recently happened with their QM computing (https://ea.rna.nl/2024/12/31/googles-willow-quantum-computer-impressive-science-and-misleading-marketing/) and you could even notice the scientists squirm in discomfort.

Expand full comment

Thank you Ernie and Gary,

To get beyond the narrowness of Math benchmarks, which "merely" ramp up algebraic complexity, we have advocated for a better benchmark design to understand the inner reasoning of the model (and used Ernie's framework to compare the existing benchmarks): https://curriculumredesign.org/wp-content/uploads/Benchmark-design-criteria-for-mathematical-reasoning-in-LLMs.pdf

Your opinion would be highly valued, thanks.

Expand full comment

science journalism at its best

Expand full comment

The "manual translation" seems to reflect (Euler's?) remark that his intelligence was often in his pencil.

Expand full comment

Thank you for this article! It is so nice to see an article on real AI — the kind that requires a lot of thought and effort and produces results that are not only what someone else did. Impressive!

Expand full comment

"the average IMO high school-aged human gold-medal winner solves 81.8%" - there are time limits in the International Mathematical Olympiad. With relaxed limits, human scores would be higher. With compute limits, scores of AlphaGeometry2 would be lower. Not sure what would be a fair time rule here.

Expand full comment

Comparable energy use could be a criterion. The human brain consumes about 20W and contestants have four-and-a-half hours to solve three problems. I wonder how pathetic results would AlphaGeometry2 produce given 0.09kWh of electricity.

Expand full comment

Your hope for better approaches is based on the assumption that we want more powerful AI, and soon. I wish I shared your optimism. To be honest, when I read about AI’s current limitations, mostly what I feel is relief.

Suppose OpenAI or someone else hits upon true, unequivocal AGI sometime in the next couple of years. Will the desire for proper safety measures override incentives to ship product? Will the world’s cultures, economies, and governments adapt in time to prevent some kind of catastrophic disruption? If AGI is the first step in a “fast takeoff” scenario, are we even remotely prepared for machine superintelligence?

More and more, I find myself hoping for slow progress and a stubborn commitment to dead ends. This of course carries its own dangers (difficulty or impossibility of true alignment, for example), but for now I think weaker is better.

Expand full comment

And solving word problems in Euclidean Geometry will help give an LLM the "ability to choose between competing courses of action based on the relative value of their consequences," behavior exhibited by the bacteria Myxococcus xanthus ...

how?

Quote from: Balleine BW. 2007. The neural basis of choice and decision making. J. Neurosci 27:8159–60

Expand full comment

Especially when:

The interior angles of a triangle sum to less than 180.

The interior angles of a triangle sum to exactly 180.

The interior angles of a triangle sum to greater than 180.

Expand full comment

Ernie, Gary,

Fine article - no mention of AGI is appreciated. "hoping" still gives me pause.

"hoping the field will develop MODELS with a deeper understanding of mathematics and the world"

The contortions you describe that DeepMind uses to build proofs in a modest corner of geometry has me wondering how the term "model" applies.

I'd love a response.

Expand full comment

The "contortions" are necessary, because neither math nor the real world have a simple model. We are conquering complexity here. And the lessons will have wider applications.

Expand full comment

The public sentiment has gone from AI is hard to AI is easy based on the LLM hype. At some point we will return back to reality, that being true AI is really hard.

Another recent opportunity for observing the deficiency in LLM reasoning appeared from the promotion of Math progress in multi-digit multiplication of OpenAI o3. The critics of AI reasoning point out LLMs are still poor at Math, but the pro AI-intelligence faction rebut this with only "but humans can't do that either."

However, they missed something important, humans would ask for a calculator and LLMs do not. The fact that they do not is a simple and apparent example of no reasoning present. I elaborated on this recently here FYI - https://www.mindprison.cc/p/why-llms-dont-ask-for-calculators

Expand full comment

What fraction of the Olympiad questions that were correctly answered had been trained upon by the bot?

If it is claimed that “none were trained on,” how do we know that this is true?

Simply trusting the folks who are making these claims is not science.

And correctly “solving” problems/solutions that were in the training data is meaningless.

Expand full comment

Superb writing. On the LLM front it is utterly consistent with behavior I experimented with and documented back in gpt-2 and forward versions, what I called “elementary physics” and ‘elementary geometry’ problems.

My “benchmark” is writing pulp fiction novels, and seeing where it generates incongruities . The models have gotten very very good, but when humans move through space in writing they easily generate situations which we humans are unlikely to write, at least not unintentionally. People in handcuffs can’t get in front and back seat of a car through separate doors. There are no echoes in open fields. The most humorous one I call “head so far up his ass he could talk through his mouth”.

My feeling is that as humans we are constantly assessing physics and projective geometry in our cognitive systems, and when we read something which contradicts our internal models as we project ideas from the text, it is quite jarring, and dissonant, it doesn’t require logic or calculation. It’s a Kahneman fast system.

Language and higher level models cannot absorb spatial reasoning we develop with our physical bodies, proprioceptive, gravity, acceleration, and visual systems. I had to read the first problem 3 times, I thought there was something I was missing - the answer is trivially obvious from a visual point of view.

I think a lot of geometric axioms (at least simpler ones like Euclidean) dont exist as language models for us. I suspect even more complex ones, topology, differential we understand without being able to describe.

Expand full comment

Fascinating. Thank you both for explaining these issues to us non-mathematicals.

Expand full comment

Gary, slightly off topic, but this might be interesting for you: https://www.wired.com/story/thomson-reuters-ai-copyright-lawsuit/

As the judge decided, fair use does not apply when the new product created with the use of copyrighted material is a direct competitor with the original work.

Expand full comment

AlphaGeometry, AlphaMath, chatbots, Waymo cars, all point to the same future:

An outrageous amount of data and statistics are used to build the framework. Augmentation will happen, as needed, with even more data, formal logic where it makes sense, or invoking simulators or other tools, depending on the application.

In the next several years we'll double-down. We'll need a lot more data, a lot more compute, and a lot of architectural work. The parts are in place. Now we need to build.

Expand full comment

"The core of the reasoning system is the kind of symbolic, geometric-theorem proving system that people have been building since the late 1950s"

What we had since then is the ability to verify proofs. What is new, and a big deal, is the ability to synthesize proofs, however imperfectly. Then verifiers take over.

Expand full comment

This is *not* new. Computable proof search has been a research project at places like CMU for well over 20 years at this point.

Expand full comment

The problem with an automated theorem prover is that it is not easy to tell it that you have 100 already proven similar theorems, and it should try to imitate them.

Yes, you can painstakingly build a collection of tactics, such as in Lean, and gradually work your way forward in a domain, but this is slow work.

That said, LLM aren't magic. Hard and unique proofs will stay hard. But easier math will likely be cataloged and automated.

Expand full comment

This is like the difference between chess-playing programs in 1960 and AlphaGo in 2016. One is toy, second is real thing.

The new approach can do do proof synthesis without lots of code that needs to be rewritten for any new class of problems.

Expand full comment