Call me when they can do the same with a completely self-contained and isolated system running on 100W of electricity. I mean really. They threw the electrical and computing capacity of a small city at these problems. And this is a big deal? It’s patently ridiculous to compare the accomplishments at all. How many gold medals would there be if each student were allowed access to a computer and the internet as in essence the AIs were allowed to do??? They parsed some math. So what.
Fundamentally I would describe this as PR desperation. These companies have a product to sell to C-suite types, the sort who will now think the on-staff quants and engineers can be replaced by AI at a 1/10th the price. And the C-suite types will blow up their companies in the process—and under the wrong set of circumstances—they might pull a Bear Sterns and pull down the economy. Just do it faster, in ways that no one in the company can then explain or debug, while absolving everyone of any sense of guilt or accountability.
Oh and it would rob us of the next “Big Short” movie because there would be no human beings involved worth writing into a script. Such a colorless and inhuman tapestry we weave these days.
I disagree with your energy argument. If this leads to substantial progress on open problems, the benefit will outweigh the cost. And The cost will be reduced drastically over time.
But I doubt this will lead to real progress on open math problems.
It's not just corporate PR too - keep in mind that OpenAI and Google are now tightly entangled with the federal govt and national security concerns via Project Stargate and the narrative of existential competition with China. They're marketing to the security state as well, and may also even be serving in a soft propaganda role to project American dominance in technology.
My basic feeling is that the accomplishments of LLMs in the world of bounded problems, no matter how difficult they may be, tells us nothing about their ability to do something genuinely original where they're exploring territory that has not already been mapped and subdivided. That takes an entirely different skill set. Is far as I know, the only way you can acquire that skill set is to go out in the world looking for problems to solve. None of the existing AIs have anything approaching that kind of agency and autonomy.
It would be interesting to know if Deep Mind has learned to do arithmetic: can it correctly multiply together two large (>20 digit) integers? AFAIK, no LLM has yet learned to do this, a skill that the vast majority of human children are able to learn. (Writing a python program to do it does not count as having learned arithmentic, by my definition of "learn".) If Deep Mind is qualitatively different, the ability to learn arithmetic would be one test.
No they need it to treat their arthritis (fun fact: the original snake oil was a working medication). They need *fake* snake oil to keep money coming in.
This is the second post in my series, Intellectual creativity, humans-in-the-loop, and AI. The first post set the theme: On the boundaries of cognition in humans and machines. The idea is simple: These days LLM-based AIs are given specific tasks, which I’m thinking of as imposing a boundary on the underlying model within which a solution is to be found/generated. The most interesting situation, though, is one where there is no boundary set. Rather, the task is to discover a problem one can work on, which I’m thinking of as imposing a boundary in the space. Here’s my first case, creating a Girardian interpretation of Steve Spielberg’s Jaws.
I’m using that as an example because, a) I did my own interpretation a couple of years ago (February 28 2022), so I remember the process, and b) more recently I had ChatGPT interpret the film (December 5, 2022), though, to be honest, comparing two performances is a distant second to my recollection of my own procress. What’s important about ChatGPT’s performance is that I gave it both the “text” (broadly understood) to interpret, and the conceptual lens though which to make the interpretation, the ideas of René Girard. ChatGPT had a specific task to perform. By contrast, I had no intention of interpreting Jaws when I decided to watch it a couple of years ago. How and why did I decide to interpret Jaws? That’s the issue.
Indeed. But it is a process. First get better at simple things, then gradually at more complex free-form ones.
Normally going out into the world to search for problems to solve would be required. However, given just how much we put the machines to use, they end up exploring a lot of the world anyway.
The world of advanced mathematics (honestly even basic mathematics) is far beyond my expertise and competence.
But I do think using tools to help solve problems and then saying then “they (ie the tools) end up exploring much of the world anyway” is attributing agency to the tool when it should be attributed to the tool user (ie a person).
I know I’m being pedantic, but I’m hyper sensitized to this kind of sloppy wording because it enables AI hype lords to equivocate their way to billions of dollars.
I am not attributing agency, either to tools or to AI.
I am saying that we, people, assign the AI enough problems that, by necessity, if it learns how to solve them, it will end up knowing a lot about how the world works. That even if it is not a robot out there seeking its own challenges to work on.
Also not saying we are anywhere near AI being that good.
Fair enough. I still think using words like “learn” and “know” implies subjectivity (and therefore the presence of a thinking conscious subject). So I always bristle when these words are used to describe what an AI is doing. Hence my pedantic urge to correct random strangers on the internet ;)
My definition of "learn" does not imply subjectivity. I want to see an LLM learn the rules of arithmetic, and then consistently apply them. So far, no LLM is able to do this. This is the jump from "this response is the most probably correct" to "this response is always correct for this type of question". In short, inductive reasoning.
Interesting. I can’t imagine anything that answers to words like “learning” or “reasoning” without some kind of consciousness or subjectivity - they are engaged, intentional, living processes. I guess that could mean that this is an artifact of language, where the words and their connotations obscure a real possibility. Alternatively (my preferred option), there’s no way an LLM could achieve anything called “learning” or “reasoning” precisely because it cannot have subjectivity or consciousness.
Do people remember a few years ago when there was a little science-comms meme of appending '....in mice' to exciting biomedical headlines to tether and contextualize everyone's expectations? I feel like there needs to be a similar exercise for these sorts of moments with LLMs- '...on school tests.' Sure, cool, interesting, novel (except for me sitting over here on the side petting Wolfram Alpha, which has sure seemed to have superhuman math skills, including explanations, for years without upending the universe)- but is there any field of human endeavor where we would expect a computer to have better performance than math, especially school math as presented as explicit problems? Meanwhile they don't consistently add right and I have never had an LLM so much as reformat a list without inventing a lie- not once. As per usual it's the implication that this 'means' something besides that heaps of engineering can make powerful tools that leads to the fundamental grossness.
The getting perfect marks on the first five problems and then a zero on the much harder sixth reminds me of the Apple paper on the Tower of Hanoi where the LRM did great on smaller versions of the puzzle but failed on larger ones. As if there’s a breaking point where they just lose their grasp on their internal model and simply degrade without warning.
Or maybe they merely had encountered the first 5 in their vast training database but not the 6th.
Without being able to inspect the database, it’s not possible to rule out the latter possibility.
It’s really rather unscientific (ridiculous, actually) that a group of people (at Google and OpenAI) who call themselves “scientists” force the actual scientific community to guess about all this stuff.
Please quantify “extremely unlikely” with a probability and please tell us what you are basing the probability on.
But regardless, I would simply reiterate my central point that the only way to be sure that none of the problems was in the training set is to look at the training set.
That would be the scientific approach. Do you have a problem with it?
The hiding of the training/inner working is quite interesting, I'd agree.
That said, I thought the problems were novel? But I guess you are saying even so, they are composed potentially of things the models had been trained on...fair point.
Without seeing the training data, one actually can’t rule out the “trained on” possibility for the “Tower of Hanoi” problems that Apple considered, either.
It is at least possible that the bots are merely mimicking solutions to versions of the problem they have encountered in training but unable to provide correct solutions for versions they have not trained on.
Occam’s razor would actually favor this possibility.
Was any version of any of these problems in the training set of either OpenAIs or DeepMinds bot?
I am NOT suggesting that they were provided the problems ahead of time but that there is another possibility.
Presumably the IMO problems (or some version thereof) had already been solved by mathematicians at some time in the past (that they were not actually unsolved problems)
These bots have undoubtedly trained on such a large number of mathematics problems that it would actually be surprising if none of the IMO problems (or versions thereof) had been encountered during training.
(The same could be said for the human contestants but it is much less likely that they have encountered the problems than the bots because of the shear number and scope of problems that the bots have processed during training.)
In order to know how well these bots ACTUALLY did, one must NECESSARILY have search access to the training database to verify that none of the problems (or versions) were trained on.
Absent such access, it’s impossible for an outsider to conclude much of anything.
Simply accepting the assurances of these companies is absurd and the fact that neither OpenAI nor DeepMind allows such access to outsiders means they are not doing science.
Let me know when 1) one of these bots solves a problem that the mathematical community considers unsolved 2) one of these companies opens up their training data for outside inspection 3) Hell freezes over
If these bots are REALLY as smart as PhDs (as some — and Sam — like to brag), why are they even entering high school math competitions? (Even the most difficult ones)
Wouldn’t that normally be disallowed? Or at least frowned upon?
Isn’t giving a gold medal to a PhD for a high school test sort of ridiculous?
I believe it was like a kid having 10 math tutors helping them pass the exam and whispering "You should try this method", "No, this formula won't work, try these 2 instead" etc.
I think your characterization is spot on. Further, I’d guess that it would take a lot less effort to get the human participants to 6/6 than it would the “AI”. Of course, that’s an unfair comparison since the humans have actual intelligence as opposed to a ridiculously huge token sequence model which can do neat tricks - sometimes - but which is completely void of any actual knowledge model and so it treats each new fact as any other token sequence (kind of like a really bad journalist that just reports whatever they hear: “… while others believe the earth is flat, and some believe it is all just a dream”).
"What allows them to do so much better than earlier systems?" Brute force. In other terms, an insane amount of computation. Deepmind talks indeed about parallel thinking, and exploration of multiple paths. This is the typical sign of desperation displayed by a computer scientist. LLMs were very efficient, and now we go back to the usual brute-force approach. They think the same techniques that worked with chess and Go will succeed with mathematics. But it is highly unlikely. Nobody will pay many thousands of dollars for solving high school problems. Humans don't work like this!
Presumably, the OpenAI and Google bots trained on the mathematical output of a huge number of mathematicians , which is why one really needs to see the training data to be sure it didn’t include any of the IMO problems — to ensure that the bots were not merely mimicking the solutions to problems encountered during training.
Opening up the training data would be the scientific thing to do but that is precisely why it is not done.
Can’t have people inspecting the training data to verify claims, now can we?
The conversion of a descriptive problem into FOL can be tedious for humans but in the IMO the problems are known. In the real world the knowledge needed to form constraints is broad and deeply buried in text. Once converted to FOL and in the case of a knowledge graph most constraints can be viewed as SOL reasoning about sets instead of enumerating all cases. In both FOL, predicate logic and SOL the only computations needed are MAX and MIN. This aicyc article explains in detail how this works and implementation in Hadoop MapReduce in AWS.
TL;DR: Semantic AI Models (SAM) and Large Language Models (LLM) work together to create a distributed inference system that mimics human cognition. SAM extracts facts and concepts from text to build a knowledge graph. It then directs the LLM to represent this knowledge in second-order logic (SOL) expressions. These SOL expressions can be computed efficiently using fuzzy logic operations like MIN and MAX in a cloud environment like AWS S3 with Hadoop MapReduce. This allows the knowledge graph to reason and infer new knowledge similar to how humans think, bringing us closer to artificial general intelligence (AGI). The approach is contrasted with neuro-symbolic AI, which is less transparent and harder to guarantee correctness compared to the explicit logic used by SAM and LLMs.
Here is an example Davis and Marcus wrote about common sense facts lacking in LLM. SAM read that paper and prompted LLM to write the SOL and FOL with facts from triples (subject, predicate,object) not only in the paper but in every domain affecting the problem statement. Here is that paper
Preface: This post examines a paper Ernest Davis and Gary Marcus developed for qualitative reasoning about containers, The purpose is to challenge the claims that generative AI alone is complete and stores all information necessary to model human like reasoning.
Citation:
Davis, E and Marcus, G. (2017).
From a real world view a symbolic AI model makes the IMO result of LLM look like a child playing in a sandbox. And SAM can prove it is right and explain why.
The SAM ontology with 10 million topics is commercially available at intellisophic.net.
What is clear is that the AI is getting smarter, and likely under the hood there is deeper context-dependent modeling, rather than just token prediction.
Gemini can already create and run code on the fly to deepen its understanding. Likely AI will get better at self-play, as AlphaGo did.
So, we'll keep on seeing more reliable AI that can do more useful work.
Freddie -- Finding proofs is a very different matter from doing calculations. The IMO problems are not easy even for professional mathematicians to solve quickly and having access to calculators, computers, or the internet is almost no help at all. As we mentioned in our piece, a group at "MathArena" tested the most powerful general purpose AIs on these problems and they all did dismally. https://matharena.ai/ Happy to discuss this further by email if you want. -- Ernie
It all depends on the questions. Einsteins great insight is the idea that matter tells space how to curve while space tells matter how to move. Of course AI could now come up with this by simply checking on things that Einstein said. But could AI have come up with this de nova and before Einstein....
Call me when they can do the same with a completely self-contained and isolated system running on 100W of electricity. I mean really. They threw the electrical and computing capacity of a small city at these problems. And this is a big deal? It’s patently ridiculous to compare the accomplishments at all. How many gold medals would there be if each student were allowed access to a computer and the internet as in essence the AIs were allowed to do??? They parsed some math. So what.
Fundamentally I would describe this as PR desperation. These companies have a product to sell to C-suite types, the sort who will now think the on-staff quants and engineers can be replaced by AI at a 1/10th the price. And the C-suite types will blow up their companies in the process—and under the wrong set of circumstances—they might pull a Bear Sterns and pull down the economy. Just do it faster, in ways that no one in the company can then explain or debug, while absolving everyone of any sense of guilt or accountability.
Oh and it would rob us of the next “Big Short” movie because there would be no human beings involved worth writing into a script. Such a colorless and inhuman tapestry we weave these days.
I disagree with your energy argument. If this leads to substantial progress on open problems, the benefit will outweigh the cost. And The cost will be reduced drastically over time.
But I doubt this will lead to real progress on open math problems.
It's not just corporate PR too - keep in mind that OpenAI and Google are now tightly entangled with the federal govt and national security concerns via Project Stargate and the narrative of existential competition with China. They're marketing to the security state as well, and may also even be serving in a soft propaganda role to project American dominance in technology.
My basic feeling is that the accomplishments of LLMs in the world of bounded problems, no matter how difficult they may be, tells us nothing about their ability to do something genuinely original where they're exploring territory that has not already been mapped and subdivided. That takes an entirely different skill set. Is far as I know, the only way you can acquire that skill set is to go out in the world looking for problems to solve. None of the existing AIs have anything approaching that kind of agency and autonomy.
It would be interesting to know if Deep Mind has learned to do arithmetic: can it correctly multiply together two large (>20 digit) integers? AFAIK, no LLM has yet learned to do this, a skill that the vast majority of human children are able to learn. (Writing a python program to do it does not count as having learned arithmentic, by my definition of "learn".) If Deep Mind is qualitatively different, the ability to learn arithmetic would be one test.
My $10 calculator knows how to do arithmetic and it doesn’t need a Python or any other snake program to do it.
Yes, your calculator does, but LLMs do not.
LLMs need snake oil to do arithmetic?
No they need it to treat their arthritis (fun fact: the original snake oil was a working medication). They need *fake* snake oil to keep money coming in.
Here's an account of a particular bit of "reasoning in the wild," the sort of open-ended reasoning that is beyond LLMs: Intellectual creativity, humans-in-the-loop, and AI: Part 2, Interpreting Jaws, https://new-savanna.blogspot.com/2025/08/intellectual-creativity-humans-in-loop.html
This is the second post in my series, Intellectual creativity, humans-in-the-loop, and AI. The first post set the theme: On the boundaries of cognition in humans and machines. The idea is simple: These days LLM-based AIs are given specific tasks, which I’m thinking of as imposing a boundary on the underlying model within which a solution is to be found/generated. The most interesting situation, though, is one where there is no boundary set. Rather, the task is to discover a problem one can work on, which I’m thinking of as imposing a boundary in the space. Here’s my first case, creating a Girardian interpretation of Steve Spielberg’s Jaws.
I’m using that as an example because, a) I did my own interpretation a couple of years ago (February 28 2022), so I remember the process, and b) more recently I had ChatGPT interpret the film (December 5, 2022), though, to be honest, comparing two performances is a distant second to my recollection of my own procress. What’s important about ChatGPT’s performance is that I gave it both the “text” (broadly understood) to interpret, and the conceptual lens though which to make the interpretation, the ideas of René Girard. ChatGPT had a specific task to perform. By contrast, I had no intention of interpreting Jaws when I decided to watch it a couple of years ago. How and why did I decide to interpret Jaws? That’s the issue.
Indeed. But it is a process. First get better at simple things, then gradually at more complex free-form ones.
Normally going out into the world to search for problems to solve would be required. However, given just how much we put the machines to use, they end up exploring a lot of the world anyway.
The world of advanced mathematics (honestly even basic mathematics) is far beyond my expertise and competence.
But I do think using tools to help solve problems and then saying then “they (ie the tools) end up exploring much of the world anyway” is attributing agency to the tool when it should be attributed to the tool user (ie a person).
I know I’m being pedantic, but I’m hyper sensitized to this kind of sloppy wording because it enables AI hype lords to equivocate their way to billions of dollars.
I am not attributing agency, either to tools or to AI.
I am saying that we, people, assign the AI enough problems that, by necessity, if it learns how to solve them, it will end up knowing a lot about how the world works. That even if it is not a robot out there seeking its own challenges to work on.
Also not saying we are anywhere near AI being that good.
Fair enough. I still think using words like “learn” and “know” implies subjectivity (and therefore the presence of a thinking conscious subject). So I always bristle when these words are used to describe what an AI is doing. Hence my pedantic urge to correct random strangers on the internet ;)
My definition of "learn" does not imply subjectivity. I want to see an LLM learn the rules of arithmetic, and then consistently apply them. So far, no LLM is able to do this. This is the jump from "this response is the most probably correct" to "this response is always correct for this type of question". In short, inductive reasoning.
Interesting. I can’t imagine anything that answers to words like “learning” or “reasoning” without some kind of consciousness or subjectivity - they are engaged, intentional, living processes. I guess that could mean that this is an artifact of language, where the words and their connotations obscure a real possibility. Alternatively (my preferred option), there’s no way an LLM could achieve anything called “learning” or “reasoning” precisely because it cannot have subjectivity or consciousness.
Do people remember a few years ago when there was a little science-comms meme of appending '....in mice' to exciting biomedical headlines to tether and contextualize everyone's expectations? I feel like there needs to be a similar exercise for these sorts of moments with LLMs- '...on school tests.' Sure, cool, interesting, novel (except for me sitting over here on the side petting Wolfram Alpha, which has sure seemed to have superhuman math skills, including explanations, for years without upending the universe)- but is there any field of human endeavor where we would expect a computer to have better performance than math, especially school math as presented as explicit problems? Meanwhile they don't consistently add right and I have never had an LLM so much as reformat a list without inventing a lie- not once. As per usual it's the implication that this 'means' something besides that heaps of engineering can make powerful tools that leads to the fundamental grossness.
The getting perfect marks on the first five problems and then a zero on the much harder sixth reminds me of the Apple paper on the Tower of Hanoi where the LRM did great on smaller versions of the puzzle but failed on larger ones. As if there’s a breaking point where they just lose their grasp on their internal model and simply degrade without warning.
That’s an interesting possibility.
Or maybe they merely had encountered the first 5 in their vast training database but not the 6th.
Without being able to inspect the database, it’s not possible to rule out the latter possibility.
It’s really rather unscientific (ridiculous, actually) that a group of people (at Google and OpenAI) who call themselves “scientists” force the actual scientific community to guess about all this stuff.
It is extremely unlikely that these problems were in a database
Please quantify “extremely unlikely” with a probability and please tell us what you are basing the probability on.
But regardless, I would simply reiterate my central point that the only way to be sure that none of the problems was in the training set is to look at the training set.
That would be the scientific approach. Do you have a problem with it?
The hiding of the training/inner working is quite interesting, I'd agree.
That said, I thought the problems were novel? But I guess you are saying even so, they are composed potentially of things the models had been trained on...fair point.
And indeed, we are all guessing, to your point.
Without seeing the training data, one actually can’t rule out the “trained on” possibility for the “Tower of Hanoi” problems that Apple considered, either.
It is at least possible that the bots are merely mimicking solutions to versions of the problem they have encountered in training but unable to provide correct solutions for versions they have not trained on.
Occam’s razor would actually favor this possibility.
A "moon landing" moment, that is to say, entirely fake?
It’s worth pointing out that a moon crash landing is still a moon landing.
I know, I know, but the joke was just so easy.
Was any version of any of these problems in the training set of either OpenAIs or DeepMinds bot?
I am NOT suggesting that they were provided the problems ahead of time but that there is another possibility.
Presumably the IMO problems (or some version thereof) had already been solved by mathematicians at some time in the past (that they were not actually unsolved problems)
These bots have undoubtedly trained on such a large number of mathematics problems that it would actually be surprising if none of the IMO problems (or versions thereof) had been encountered during training.
(The same could be said for the human contestants but it is much less likely that they have encountered the problems than the bots because of the shear number and scope of problems that the bots have processed during training.)
In order to know how well these bots ACTUALLY did, one must NECESSARILY have search access to the training database to verify that none of the problems (or versions) were trained on.
Absent such access, it’s impossible for an outsider to conclude much of anything.
Simply accepting the assurances of these companies is absurd and the fact that neither OpenAI nor DeepMind allows such access to outsiders means they are not doing science.
Let me know when 1) one of these bots solves a problem that the mathematical community considers unsolved 2) one of these companies opens up their training data for outside inspection 3) Hell freezes over
My money is on number 3 happening first.
If these bots are REALLY as smart as PhDs (as some — and Sam — like to brag), why are they even entering high school math competitions? (Even the most difficult ones)
Wouldn’t that normally be disallowed? Or at least frowned upon?
Isn’t giving a gold medal to a PhD for a high school test sort of ridiculous?
And isn’t accepting the gold even more ridiculous?
Have these bots NO shame?
If I got 5 out of 6 problems right on a difficult math (or any other) test having seen the problems and solutions before, few would be impressed.
And for good reason. Seeing the problems and answers ahead of time is commonly referred to as cheating. (is that why they are called “cheatbots”?)
So why is anyone impressed by the performance of an AI without knowing if they have “seen” the problems before?
I'm a mathematician and a developer,and here are my thoughts:
How is it possible that no publicly available model is anywhere close,
yet the advanced version of Gemini Deep Think and OpenAI's experimental reasoning LLM walk away with International Mathematical Olympiad gold medals?
Something doesn't add up.
If you check out MathArena
(MathArena is a platform for the evaluation of LLMs on the latest math competitions and olympiads. )
publicly available models Gemini 2.5 Pro achieved the highest score with an average of 31% (13 points),
and o3 scored just 16.67% on the same IMO2025 competition.
https://mathmindset.substack.com/p/issue-7-ais-gold-medal-math-hype
The answer is that they likely threw 10-100x the compute. Usual brute force stuff.
I believe it was like a kid having 10 math tutors helping them pass the exam and whispering "You should try this method", "No, this formula won't work, try these 2 instead" etc.
I think your characterization is spot on. Further, I’d guess that it would take a lot less effort to get the human participants to 6/6 than it would the “AI”. Of course, that’s an unfair comparison since the humans have actual intelligence as opposed to a ridiculously huge token sequence model which can do neat tricks - sometimes - but which is completely void of any actual knowledge model and so it treats each new fact as any other token sequence (kind of like a really bad journalist that just reports whatever they hear: “… while others believe the earth is flat, and some believe it is all just a dream”).
Keep in mind also that the participants at IMO25 are high schoolers, not mathematicians with PhDs who were likely giving tips and hints to models :)
But Sam Altman said their bot is as good as PhDs in a wide variety of disciplines.
So shouldn’t it have actually aced the test?
Or maybe their bot is as good as PhDs in such hard science disciplines as theoretical tiddly winks but only as good as a high schooler in math?
It’s all so inconsistent.
And inconsistency tells you there is something wrong with the story
"What allows them to do so much better than earlier systems?" Brute force. In other terms, an insane amount of computation. Deepmind talks indeed about parallel thinking, and exploration of multiple paths. This is the typical sign of desperation displayed by a computer scientist. LLMs were very efficient, and now we go back to the usual brute-force approach. They think the same techniques that worked with chess and Go will succeed with mathematics. But it is highly unlikely. Nobody will pay many thousands of dollars for solving high school problems. Humans don't work like this!
Lucky for ChatGPT it wasn’t the 5th grade spelling bee.
That writing style is psychotic. Did they train the model on the unpublished notes of John Forbes Nash?
Presumably, the OpenAI and Google bots trained on the mathematical output of a huge number of mathematicians , which is why one really needs to see the training data to be sure it didn’t include any of the IMO problems — to ensure that the bots were not merely mimicking the solutions to problems encountered during training.
Opening up the training data would be the scientific thing to do but that is precisely why it is not done.
Can’t have people inspecting the training data to verify claims, now can we?
It would be fascinating to get your take on this, Dr. Marcus.
https://www.scientificamerican.com/article/inside-the-secret-meeting-where-mathematicians-struggled-to-outsmart-ai/
If there's one thing a human-level AGI should excel at at it's maths.
And telling tall tales.
The bots got that wired.
The conversion of a descriptive problem into FOL can be tedious for humans but in the IMO the problems are known. In the real world the knowledge needed to form constraints is broad and deeply buried in text. Once converted to FOL and in the case of a knowledge graph most constraints can be viewed as SOL reasoning about sets instead of enumerating all cases. In both FOL, predicate logic and SOL the only computations needed are MAX and MIN. This aicyc article explains in detail how this works and implementation in Hadoop MapReduce in AWS.
http://aicyc.org/2025/03/13/how-sam-thinks/
TL;DR: Semantic AI Models (SAM) and Large Language Models (LLM) work together to create a distributed inference system that mimics human cognition. SAM extracts facts and concepts from text to build a knowledge graph. It then directs the LLM to represent this knowledge in second-order logic (SOL) expressions. These SOL expressions can be computed efficiently using fuzzy logic operations like MIN and MAX in a cloud environment like AWS S3 with Hadoop MapReduce. This allows the knowledge graph to reason and infer new knowledge similar to how humans think, bringing us closer to artificial general intelligence (AGI). The approach is contrasted with neuro-symbolic AI, which is less transparent and harder to guarantee correctness compared to the explicit logic used by SAM and LLMs.
Here is an example Davis and Marcus wrote about common sense facts lacking in LLM. SAM read that paper and prompted LLM to write the SOL and FOL with facts from triples (subject, predicate,object) not only in the paper but in every domain affecting the problem statement. Here is that paper
http://aicyc.org/2025/04/24/gary-marcus-common-sense-problem-and-sams-symbolic-ai-solution/
Preface: This post examines a paper Ernest Davis and Gary Marcus developed for qualitative reasoning about containers, The purpose is to challenge the claims that generative AI alone is complete and stores all information necessary to model human like reasoning.
Citation:
Davis, E and Marcus, G. (2017).
From a real world view a symbolic AI model makes the IMO result of LLM look like a child playing in a sandbox. And SAM can prove it is right and explain why.
The SAM ontology with 10 million topics is commercially available at intellisophic.net.
Lots of questions, indeed.
What is clear is that the AI is getting smarter, and likely under the hood there is deeper context-dependent modeling, rather than just token prediction.
Gemini can already create and run code on the fly to deepen its understanding. Likely AI will get better at self-play, as AlphaGo did.
So, we'll keep on seeing more reliable AI that can do more useful work.
Re-liable?
Able to lie again?
Yes, I think we are seeing that.
Why on earth would it be notable that a computer is good at math?
Freddie -- Finding proofs is a very different matter from doing calculations. The IMO problems are not easy even for professional mathematicians to solve quickly and having access to calculators, computers, or the internet is almost no help at all. As we mentioned in our piece, a group at "MathArena" tested the most powerful general purpose AIs on these problems and they all did dismally. https://matharena.ai/ Happy to discuss this further by email if you want. -- Ernie
It all depends on the questions. Einsteins great insight is the idea that matter tells space how to curve while space tells matter how to move. Of course AI could now come up with this by simply checking on things that Einstein said. But could AI have come up with this de nova and before Einstein....
so ...?