o3, AGI, the art of the demo, and what you…

Dec 21, 2024

OpenAI’s new model was revealed yesterday; its most fervent believers think AGI has already arrived. Here’s what you should pay attention to in the coming year.

Read →

70 Comments

Gerben Wierda

Dec 21

The simplest way to describe what the o1/o3 GPTs probably basically do is this: instead of being finetuned on generating output directly, it is finetuned on generating CoT steps (in a number of specific domains). For this it has been RLHF'd.

So instead of approximating the textual results, it approximates the textual form of the reasoning steps. It also heavily features the generation of multiple continuations in parallel with evaluation functions and pruning of less scoring ones. Both approaches substantially grow the amount of basic transformer-work (hence the expense). But it still remains the same approximation approach, though because it approximates 'reasoning text' it works better for (specific) 'reasoning tasks'. This improves *on those specific tasks* the approximation quality. But the price is inference cost and more brittleness. Which is why it isn't *generally* better. Some tasks, standard 4o will do better, for instance those that revolve around regenerating meaning out of training material where not so much reasoning but 'memory' is required.

Impressive engineering, certainly, but not a step to AGI, as the core remains 'without understanding brute force approximating the results of understanding'

(My quick educated guess, this)

Expand full comment

Reply (2)

Spartacus

Dec 21

I think it's just multiple battling AIs. Link n LLMs to make the calls. I think that's obvious from the cost. Not much rethinking or creativity.

More LLM! MORE! MOOORRRRRE!

Expand full comment

Reply (2)

Gary Marcus

Dec 21

Moar!!!

Expand full comment

Gerben Wierda

Dec 21

I suspect François Chollet is right that o1/o3 is not just more of the same, but also smart architecture (and I suspect: tuning geared towards specific benchmarks using those architectural changes). The internal complexity of their transformer must have grown from building ever more *internal parallelism*. A year ago Gemini had 'external parallelism' (run the base LLM multiple times and use a (crude? LLM?) evaluation function to pick the result displayed. This parallelism is now being internalized. I also suspect that parameters are getting temporarily adapted during inference. (Or at least, those are the avenues I saw open to the transformer approach a while back, and some see likely).

These are pretty good engineers and they're pushing the envelope of getting most out of a fundamentally limited approach)

Expand full comment

Reply (2)

Fabian Transchel

Dec 21

My guess is that they've been working on ARC since mid 2023 when it became clear [for OpenAI] that GPT-4 was saturating.

While it is impressive, it is the same kind of impressive as beating Lee Seedol with the power equivalent of a maximized coal plant[1].

So OpenAI can reason? If a human could reason with the equivalent of this energy expenditure, I'd expect FTL drive, the Star Trek transporter *and* cure for cancer[2] in 2025. But instead this thing still can't color boxes in one of five cases.

[1] While this is by no means meant as derogatory toward GO, its rules are static and hence computationally accessible. Gary is right to call for mastery of those kinds of computer games where both rules and strategies are fluid.

[2] For reference: I fully expect that the first two are physically unattainable, while the latter is obviously worth venturing to.

Expand full comment

Reply (1)

Gerben Wierda

Dec 21

The reason we humans can do all the things we do with 20W and very quickly has to do that most of our actions are automated too (our mental automation is called convictions, assumptions, beliefs, as well as is shown at lower level calculations). See https://youtu.be/3riSN5TCuoE the prologue and 34minutes in.

The obvious reason is that that while stability and reliability of integers (digital) is perfect, the power of reals is required to really do most of the stuff we do. Our 'malleable hardware' is not very good at the discrete stuff, but you can get from reals to discrete. The other way around isn't mathematically possible (except if you fool around with 0/0 😉)

Expand full comment

Reply (1)

Fabian Transchel

Dec 21

"The reason we humans can do all the things we do with 20W and very quickly has to do that most of our actions are automated too"

Right, and that's precisely what I was alluding to: Transformer token embeddings may be holographic data storage and whatnot, they are fundamentally discrete in the sense that at the end of the day there is *one* token predicted at a time.

Working around this with CoT-ensembles is bound to be terribly inefficient no matter how "efficient" the CoT itself may be.

Expand full comment

Reply (1)

Gerben Wierda

Dec 22

The world isn't Q, it is R

Expand full comment

Reply (1)

Spartacus

Dec 21

Alright. But what you're describing o3 doing is what I described too. We just have different views on whether it represents creativity as opposed to brute force. I'm calling it brute.

Expand full comment

Reply (1)

Gerben Wierda

Dec 21

We agree this is brute force (that's an easy one...)

Creativity is a potential outcome, because at its core, the technology employs limited randomness (during next token selection, and maybe also in some other places we don't know of). This is a limited form of creativity. But so is ours. Here too: creativity is a necessary component of intelligence, but not a sufficient one.

Expand full comment

Reply (1)

Spartacus

Dec 21

I meant that the human engineers working on o3 weren't being creative in their approach.

Everyday Astronaut is about to post a video of rocket designs from AI that he renders as realistic rockets in Kerbal and tests in silico.

It's very entertaining.

Expand full comment

Andy X Andersen

Dec 21

Search, evaluation, following patterns, and even brute force, are part of how people solve hard problems when we lack intuition. Any AGI should be able to do that. Going forward, what needs to be added is more grounding, which can come from honest models.

Expand full comment

Bruce Olsen

Dec 21

I worked in database R&D for decades, and for a time the big relational vendors (Oracle, Informix, Sybase, Ingres) were competing based on performance. This led to the creation of a variety of vendor-neutral benchmarks. One might simulate a simple banking transaction, another would simulate a high volume, bulk database update.

Before long every vendor started creating what were called "benchmark specials" by tailoring their software toward the needs of each benchmark suite. Sometimes these capabilities were not considered safe (because they could compromise database integrity, typically) so they couldn't be added to the mainline product; rather, they'd be activated by secret parameters specified at the start of the benchmark run.

The entire Valley has known about benchmark specials for decades. There's a famous 1981 case where a company called Paradyne created fake hardware in order to win a $115 million Social Security contract. The SEC said one device was “nothing more than an empty box with blinking lights” (Paradyne countered by claiming the “empty box” was intended to show how the final product would work).

So I wouldn't rule out the creation of a "benchmark special"

Expand full comment

Reply (4)

Gary Marcus

Dec 21

that’s so interesting I reposted (without your name) on X

Expand full comment

Reply (1)

Bruce Olsen

Dec 22

Well, thank you for the honor, and for defending my honor. And thanks for keeping after the truth. AI bros might be worse than crypto bros.

Back in the day one of my employers licensed AICorp's product, which "understood" natural language, and integrated it with our database as an "advanced" query language. It, too, demoed well.

Expand full comment

Fabian Transchel

Dec 21

"So I wouldn't rule out the creation of a "benchmark special"

Everybody who witnessed the GHz race in the early 2000's and its subsequent stalling through energy limits knows that's exactly what happened to both CPU *and* GPU benchmarks as well.

When you go back and take an honest look at single core performance of the last decade, it has not increased by more than 100%, when the OEMs would (incorrectly) cite Moore's law (lol) to indicate it should have been *at least* 3200%.

Expand full comment

Reply (1)

Bruce Olsen

Dec 22

I recall accusations that some vendors sold "Whetstone machines" with circuitry and microcode built specifically for that benchmark and not used IRL

Fraud springs eternal in the human breast.

Expand full comment

Paul Jurczak

Dec 22

That's what I was going to write about, too. Thank you for saving me some keystrokes. ;-)

I've been around the software demo and benchmark world for quite a while, too. I suspect the emphasis in training data was placed on doing well in ARC and FrontierMath benchmarks. Anyone at OpenAI cares to rapidly exhale into a whistle?

Expand full comment

Larry Jewett

Dec 22

“The empty box with blinking lights” benchmark special is reminiscent of the “KMart special”

And while they might not be completely empty boxes, LLMs are black boxes, albeit with blinking Altmans

Expand full comment

Paul Topping

Dec 21

What I think we are seeing are AI researchers surrounding their LLMs with more and more non-LLM algorithms and techniques that, in essence, are inspired by their own human abilities. In other words, they are doing good old heuristic AI programming while telling themselves they are just improving their LLM. Perhaps they will soon realize that it's the heuristics that matter most. I predict that the eventual first AGI will be virtually 100% heuristics. LLMs will be seen as useful tools in their own right but not much to do with AGI.

Expand full comment

Reply (1)

Andy X Andersen

Dec 21

LLM is what drives everything though. It is a glorified tool for retrieval and synthesis. It gets used iteratively in tandem with other tools to keep it honest. Lessons learned will go back in the pot.

Expand full comment

Reply (1)

Paul Topping

Dec 21

It's the stuff keeping the LLMs "honest" that they should be focusing on. Pretty soon they will realize they just need to kick the lying LLM part to the curb.

Expand full comment

Reply (2)

Larry Jewett

Dec 22

Oh, what a tangled World Wide Web we weave when first we chat with GPT

Expand full comment

Andy X Andersen

Dec 21

LLM plays the role of that fuzzy gut feeling that allows us to make guesswork. Then we flesh it out with detailed work and individual tools.

Expand full comment

James McDermott

Dec 21

> True AGI requires reasoning frameworks that ensure correctness, adaptability, and logical consistency.

This cannot be the bar, as it would imply that humans are not intelligent.

Expand full comment

Reply (3)

Fabian Transchel

Dec 21

You are correct of course, though tonge-in-cheek one might answer that nobody claimed humans to be *artificial* general intelligence <:)

Expand full comment

Larry Jewett

Dec 22

Intelligence is actually overrated anyway.

Trilobites lasted 270 million years and would have lasted even longer were it not for massive volcanic eruptions at the end of the Permian

I’d give humans another 100 years tops.

But unlike Hawking, I don’t think colonizing other planets like Mars is going to help. If humans can’t even get along with one another on a planet of abundance that is well suited for their survival, there is no way they will be able to get along — without fighting to the death over scarce resources — on an inhospitable planet like Mars.

Expand full comment

Larry Jewett

Dec 22

It’s actually a very good question whether humans are generally intelligent.

Based on the way we, as a collective group, are handling — or not handling — a number of issues that may threaten our very existence, I’d have to say no.

Expand full comment

Reply (2)

Larry Jewett

Dec 22

We are actually learning a great deal about our own intelligence (or lack thereof) from experiments with LLMs. And much of what we are learning is actually unflattering.

We like to think that we are the most intelligent species on the planet , but have a definition of intelligence that assures that that is the case.

Expand full comment

James McDermott

Dec 22

But AGI is usually defined along the lines of - as intelligent and as general as humans. An AI of that level would be an incredible breakthrough, highly disruptive and highly dangerous, even if we pooh-pooh it in this way.

Expand full comment

antoinette.uiterdijk

Dec 21Edited

The US has about 500,000 well-educated, clever, creative, women, not allowed to contribute to US society. Give them work permits, some of the money now being spent on AI - and we will show you what the human mind can do. If you just let us.

Edit: I forgot to mention the young adults in the same situation.

Expand full comment

Rick Frank

Dec 21

There’s another amorphous thing called common sense. It’s hard to define, but I’ve worked with a lot of bright but narrow minds in my career that seem to lack. Often these people lack social skills. Or empathy. They can often excel at one task but fail at others ( e.g., parenting, relationships). What is concerning to me is this AGI may be being developed by the tech bros that value this standardized test kind of measurement. I don’t think that narrow focus is “better at most human tasks”, unless you define the tasks that narrowly.

Expand full comment

Kevin Zatloukal

Dec 21

Re: your comment about NNs being much better "within distribution" than outside of it...

I don't think there is enough discussion about the downsides of having a tool that is great at problems you find many examples of in textbooks and tutorials — i.e., homework problems — but not great at real-world problems.

It's not enough that LLMs are eventually useful. They need to be *more* beneficial than the loss to society from having created the world's best tool for cheating on homework assignments.

Expand full comment

Bill Benzon

Dec 21

Beating benchmarks, even very difficult ones, is all find and dandy, but we must remember that those tests, no matter how difficult, are at best only a limited measure of human ability. Why? Because they present the test-take with a well-defined situation to which they must respond. Life isn't like that. It's messy and murky. Perhaps the most difficult step is to wade into the mess and the murk and impose a structure on it – perhaps by simply asking a question – so that one can then set about dealing with that situation in terms of the imposed structure. Tests give you a structured situation. That's not what the world does.

Consider this passage from Sam Rodiques, "What does it take to build an AI Scientist": https://www.sam-rodriques.com/post/what-does-it-take-to-build-an-ai-scientist

"Scientific reasoning consists of essentially three steps: coming up with hypotheses, conducting experiments, and using the results to update one’s hypotheses. Science is the ultimate open-ended problem, in that we always have an infinite space of possible hypotheses to choose from, and an infinite space of possible observations. For hypothesis generation: How do we navigate this space effectively? How do we generate diverse, relevant, and explanatory hypotheses? It is one thing to have ChatGPT generate incremental ideas. It is another thing to come up with truly novel, paradigm-shifting concepts. "

Right.

How do we put o3, or any other AI, out in the world where it can roam around, poke into things, and come up with its own problems to solve? If you want AGI in any deep and robust sense, that's what you have to do. That calls for real agency. I don't see that OpenAI or any other organization is anywhere close to figuring out how to do this.

Expand full comment

Reply (1)

Joy in HK

Dec 22

Not to mention that any such system that is currently under consideration is a tiny subset of, not just the real world, but of human's own limited senses used to perceive that world. There is likely to be something that comes of this effort, beneficial or otherwise, but it won't be intelligence on a comparable level to what organizisms developed through using all the known, and yet to be discovered, senses, the majority of which are not part of this project. To think what's knowingly left out, not to mention the part that's left out because we don't even know it exists, will succeed in replicating human-level intelligence is simply hubris.

Expand full comment

Russ Abbott

Dec 21Edited

o The o3 demo disappointed me in that it didn't show anything happening live on the screen. It just discussed the results. Why was that?

o What do you think of Gemini 2.0 Flash Experimental Thinking? Here's an interesting 3rd party demo. (https://www.youtube.com/watch?v=podMF0FNJac&t=217s) (The link skips the first few minutes that contains a discussion that is not relevant to the system's capabilities. Feel free to rewind to the beginning if you prefer.)

o One thing I find impressive about LLMs in general is their seeming ability to "understand" and "obey" instructions. I haven't seen a discussion of how the transformer mechanism creates this capability. Nor have I seen any studies of capabilities and limitations in this area. That is, to what extent can one really "program" an LLM in English? What are the limitations? (I'm not talking about jail-breaking here.) For example, suppose you provided an English description of an algorithm, e.g., merge-sort. Would any LLM be able to carry out that algorithm and demonstrate, step-by-step, how it works when given some input data? I suspect not--unless it does so by regurgitating a demonstration it has seen in its training. If an LLM does produce such a demonstration is it also capable of answering questions about what it did and why?

Expand full comment

Reply (1)

Anthony

Dec 21

I was surprised at how good the Gemini 1.5 Research was. I gave it a difficult quite nuanced topic to research and it picked the eyes out of it. Better than Perplexity and o1.

Expand full comment

George Burch

Dec 21Edited

The basic problem is the unreasonable belief that one model is AI and all future developments leading to AGI must use this model no matter how flawed.

Gary has posited another model, symbolic AI,

based on symbolic logic as another AI to reason about the world. This approach can be certified accurate over all fields of knowledge.

Here an example of a commercially available symbolic (semantic) AI model called SAM.

http://aicyc.org/2024/10/05/how-sam-thinks/

Expand full comment

Tom Gottsche

Dec 21

I somehow feel reminded of a variant of Campbell's law. Replace social and social indicator with AI benchmarks and that's what may have happened, along the lines of "The more any quantitative AI performance indicator is used for benchmarking, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the processes it is intended to monitor."

Maybe they changed something qualitatively, but I'm afraid OpenAI is just scouring the internet for 4o failures, then they tweak their system, pass, and try to pass the results for AGI. It's pure operationalism. Impressive from a software engineering point of view, but not more than a slightly better and more expensive mouse trap.

Even corporate users may be a little skeptical. If the fully-loaded costs (usage and error correction) of using the new system exceeds the cheapest labor they can get, their bottomline will take a hit. Or, they will stick with a cheaper model. Or, they will go "hybrid", cheap(er) model plus cheap(er) labor = higher profit. For now.

Expand full comment

Reply (1)

James McDermott

Dec 21

I'm fairly certain this is not it, as in my understanding the internet doesn't contain any examples of failures on the semi-private evaluation set. https://arcprize.org/arc-agi-pub

Expand full comment

Reply (1)

Tom Gottsche

Dec 23

I was referring to the many failures that have been reported. Gary has written about some of them as well. So, not just ARC. Regarding ARC, you're probably right. My main point/assumption remains the same: they are tweaking. We'll see.

Expand full comment

Aaron Turner

Dec 21

If it was genuinely human-level AGI they would have called it GPT5

Expand full comment

Gerard

Dec 23

OpenAI presents the idea that an AI performing well on a benchmark indicates overall superiority, but the truth is more complicated. The data is manipulated, and the AI often struggles not only in general applications but even on the tests themselves.

They rely on the assumption that people won’t question their claims, but skepticism is growing. Today’s AI remains highly specialized, excelling only in narrow tasks. Highlighting cherry-picked benchmarks benefits no one, and the reality is that innovation has been stagnant for years.

An AI might achieve “superhuman” performance at math problems, but that doesn’t make it useful for other purposes. Their business strategy thrives on misleading claims and intellectual smoke and mirrors. When faced with real-world scrutiny, their model risks collapse.

A high-end kitchen appliance might excel at chopping onions, but that alone won’t make a restaurant successful or improve the quality of the food it serves.

Expand full comment

Bill Benzon

Dec 23

Keep in mind that, given OpenAI's contractual obligations with Microsoft, a claim of reaching AGI has legal repercussions if pursued in earnest. I doubt that OpenAI wants to force Microsoft to take them into court over the meaning of "AGI."

Expand full comment

Chad Woodford

Dec 22

I wonder when AGI promise fatigue will set in among the general populace and we'll collectively stop believing claims of AGI or accepting further movement of the AGI goalposts.

Expand full comment

Art Keller

Dec 22

seems like latest in the never-ending string of demos a la the "Boston Dynamics" robot vids. They seem unbelievably amazing, and they don't show that the 'bot can only do that very specific trick under very specific circumstances with massive engineering support behind the scenes, not in the wild. they don't demonstrate the routine functionality of every-day tasks you actually want. But rigged demos are great and driving hype and funding and that's Sam Altman's bread and butter.

Expand full comment

Marcus on AI

o3, AGI, the art of the demo, and what you…