44 Comments

The simplest way to describe what the o1/o3 GPTs probably basically do is this: instead of being finetuned on generating output directly, it is finetuned on generating CoT steps (in a number of specific domains). For this it has been RLHF'd.

So instead of approximating the textual results, it approximates the textual form of the reasoning steps. It also heavily features the generation of multiple continuations in parallel with evaluation functions and pruning of less scoring ones. Both approaches substantially grow the amount of basic transformer-work (hence the expense). But it still remains the same approximation approach, though because it approximates 'reasoning text' it works better for (specific) 'reasoning tasks'. This improves *on those specific tasks* the approximation quality. But the price is inference cost and more brittleness. Which is why it isn't *generally* better. Some tasks, standard 4o will do better, for instance those that revolve around regenerating meaning out of training material where not so much reasoning but 'memory' is required.

Impressive engineering, certainly, but not a step to AGI, as the core remains 'without understanding brute force approximating the results of understanding'

(My quick educated guess, this)

Expand full comment

I think it's just multiple battling AIs. Link n LLMs to make the calls. I think that's obvious from the cost. Not much rethinking or creativity.

More LLM! MORE! MOOORRRRRE!

Expand full comment

Moar!!!

Expand full comment

I suspect François Chollet is right that o1/o3 is not just more of the same, but also smart architecture (and I suspect: tuning geared towards specific benchmarks using those architectural changes). The internal complexity of their transformer must have grown from building ever more *internal parallelism*. A year ago Gemini had 'external parallelism' (run the base LLM multiple times and use a (crude? LLM?) evaluation function to pick the result displayed. This parallelism is now being internalized. I also suspect that parameters are getting temporarily adapted during inference. (Or at least, those are the avenues I saw open to the transformer approach a while back, and some see likely).

These are pretty good engineers and they're pushing the envelope of getting most out of a fundamentally limited approach)

Expand full comment

My guess is that they've been working on ARC since mid 2023 when it became clear [for OpenAI] that GPT-4 was saturating.

While it is impressive, it is the same kind of impressive as beating Lee Seedol with the power equivalent of a maximized coal plant[1].

So OpenAI can reason? If a human could reason with the equivalent of this energy expenditure, I'd expect FTL drive, the Star Trek transporter *and* cure for cancer[2] in 2025. But instead this thing still can't color boxes in one of five cases.

[1] While this is by no means meant as derogatory toward GO, its rules are static and hence computationally accessible. Gary is right to call for mastery of those kinds of computer games where both rules and strategies are fluid.

[2] For reference: I fully expect that the first two are physically unattainable, while the latter is obviously worth venturing to.

Expand full comment

The reason we humans can do all the things we do with 20W and very quickly has to do that most of our actions are automated too (our mental automation is called convictions, assumptions, beliefs, as well as is shown at lower level calculations). See https://youtu.be/3riSN5TCuoE the prologue and 34minutes in.

The obvious reason is that that while stability and reliability of integers (digital) is perfect, the power of reals is required to really do most of the stuff we do. Our 'malleable hardware' is not very good at the discrete stuff, but you can get from reals to discrete. The other way around isn't mathematically possible (except if you fool around with 0/0 😉)

Expand full comment

"The reason we humans can do all the things we do with 20W and very quickly has to do that most of our actions are automated too"

Right, and that's precisely what I was alluding to: Transformer token embeddings may be holographic data storage and whatnot, they are fundamentally discrete in the sense that at the end of the day there is *one* token predicted at a time.

Working around this with CoT-ensembles is bound to be terribly inefficient no matter how "efficient" the CoT itself may be.

Expand full comment

Alright. But what you're describing o3 doing is what I described too. We just have different views on whether it represents creativity as opposed to brute force. I'm calling it brute.

Expand full comment

We agree this is brute force (that's an easy one...)

Creativity is a potential outcome, because at its core, the technology employs limited randomness (during next token selection, and maybe also in some other places we don't know of). This is a limited form of creativity. But so is ours. Here too: creativity is a necessary component of intelligence, but not a sufficient one.

Expand full comment

I meant that the human engineers working on o3 weren't being creative in their approach.

Everyday Astronaut is about to post a video of rocket designs from AI that he renders as realistic rockets in Kerbal and tests in silico.

It's very entertaining.

Expand full comment

Search, evaluation, following patterns, and even brute force, are part of how people solve hard problems when we lack intuition. Any AGI should be able to do that. Going forward, what needs to be added is more grounding, which can come from honest models.

Expand full comment

I worked in database R&D for decades, and for a time the big relational vendors (Oracle, Informix, Sybase, Ingres) were competing based on performance. This led to the creation of a variety of vendor-neutral benchmarks. One might simulate a simple banking transaction, another would simulate a high volume, bulk database update.

Before long every vendor started creating what were called "benchmark specials" by tailoring their software toward the needs of each benchmark suite. Sometimes these capabilities were not considered safe (because they could compromise database integrity, typically) so they couldn't be added to the mainline product; rather, they'd be activated by secret parameters specified at the start of the benchmark run.

The entire Valley has known about benchmark specials for decades. There's a famous 1981 case where a company called Paradyne created fake hardware in order to win a $115 million Social Security contract. The SEC said one device was “nothing more than an empty box with blinking lights” (Paradyne countered by claiming the “empty box” was intended to show how the final product would work).

So I wouldn't rule out the creation of a "benchmark special"

Expand full comment

that’s so interesting I reposted (without your name) on X

Expand full comment

"So I wouldn't rule out the creation of a "benchmark special"

Everybody who witnessed the GHz race in the early 2000's and its subsequent stalling through energy limits knows that's exactly what happened to both CPU *and* GPU benchmarks as well.

When you go back and take an honest look at single core performance of the last decade, it has not increased by more than 100%, when the OEMs would (incorrectly) cite Moore's law (lol) to indicate it should have been *at least* 3200%.

Expand full comment

What I think we are seeing are AI researchers surrounding their LLMs with more and more non-LLM algorithms and techniques that, in essence, are inspired by their own human abilities. In other words, they are doing good old heuristic AI programming while telling themselves they are just improving their LLM. Perhaps they will soon realize that it's the heuristics that matter most. I predict that the eventual first AGI will be virtually 100% heuristics. LLMs will be seen as useful tools in their own right but not much to do with AGI.

Expand full comment

LLM is what drives everything though. It is a glorified tool for retrieval and synthesis. It gets used iteratively in tandem with other tools to keep it honest. Lessons learned will go back in the pot.

Expand full comment

It's the stuff keeping the LLMs "honest" that they should be focusing on. Pretty soon they will realize they just need to kick the lying LLM part to the curb.

Expand full comment

LLM plays the role of that fuzzy gut feeling that allows us to make guesswork. Then we flesh it out with detailed work and individual tools.

Expand full comment

> True AGI requires reasoning frameworks that ensure correctness, adaptability, and logical consistency.

This cannot be the bar, as it would imply that humans are not intelligent.

Expand full comment

You are correct of course, though tonge-in-cheek one might answer that nobody claimed humans to be *artificial* general intelligence <:)

Expand full comment

The basic problem is the unreasonable belief that one model is AI and all future developments leading to AGI must use this model no matter how flawed.

Gary has posited another model, symbolic AI,

based on symbolic logic as another AI to reason about the world. This approach can be certified accurate over all fields of knowledge.

Here an example of a commercially available symbolic (semantic) AI model called SAM.

http://aicyc.org/2024/10/05/how-sam-thinks/

Expand full comment

Beating benchmarks, even very difficult ones, is all find and dandy, but we must remember that those tests, no matter how difficult, are at best only a limited measure of human ability. Why? Because they present the test-take with a well-defined situation to which they must respond. Life isn't like that. It's messy and murky. Perhaps the most difficult step is to wade into the mess and the murk and impose a structure on it – perhaps by simply asking a question – so that one can then set about dealing with that situation in terms of the imposed structure. Tests give you a structured situation. That's not what the world does.

Consider this passage from Sam Rodiques, "What does it take to build an AI Scientist": https://www.sam-rodriques.com/post/what-does-it-take-to-build-an-ai-scientist

"Scientific reasoning consists of essentially three steps: coming up with hypotheses, conducting experiments, and using the results to update one’s hypotheses. Science is the ultimate open-ended problem, in that we always have an infinite space of possible hypotheses to choose from, and an infinite space of possible observations. For hypothesis generation: How do we navigate this space effectively? How do we generate diverse, relevant, and explanatory hypotheses? It is one thing to have ChatGPT generate incremental ideas. It is another thing to come up with truly novel, paradigm-shifting concepts. "

Right.

How do we put o3, or any other AI, out in the world where it can roam around, poke into things, and come up with its own problems to solve? If you want AGI in any deep and robust sense, that's what you have to do. That calls for real agency. I don't see that OpenAI or any other organization is anywhere close to figuring out how to do this.

Expand full comment

Not to mention that any such system that is currently under consideration is a tiny subset of, not just the real world, but of human's own limited senses used to perceive that world. There is likely to be something that comes of this effort, beneficial or otherwise, but it won't be intelligence on a comparable level to what organizisms developed through using all the known, and yet to be discovered, senses, the majority of which are not part of this project. To think what's knowingly left out, not to mention the part that's left out because we don't even know it exists, will succeed in replicating human-level intelligence is simply hubris.

Expand full comment

o The o3 demo disappointed me in that it didn't show anything happening live on the screen. It just discussed the results. Why was that?

o What do you think of Gemini 2.0 Flash Experimental Thinking? Here's an interesting 3rd party demo. (https://www.youtube.com/watch?v=podMF0FNJac&t=217s) (The link skips the first few minutes that contains a discussion that is not relevant to the system's capabilities. Feel free to rewind to the beginning if you prefer.)

o One thing I find impressive about LLMs in general is their seeming ability to "understand" and "obey" instructions. I haven't seen a discussion of how the transformer mechanism creates this capability. Nor have I seen any studies of capabilities and limitations in this area. That is, to what extent can one really "program" an LLM in English? What are the limitations? (I'm not talking about jail-breaking here.) For example, suppose you provided an English description of an algorithm, e.g., merge-sort. Would any LLM be able to carry out that algorithm and demonstrate, step-by-step, how it works when given some input data? I suspect not--unless it does so by regurgitating a demonstration it has seen in its training. If an LLM does produce such a demonstration is it also capable of answering questions about what it did and why?

Expand full comment

I was surprised at how good the Gemini 1.5 Research was. I gave it a difficult quite nuanced topic to research and it picked the eyes out of it. Better than Perplexity and o1.

Expand full comment

The US has about 500,000 well-educated, clever, creative, women, not allowed to contribute to US society. Give them work permits, some of the money now being spent on AI - and we will show you what the human mind can do. If you just let us.

Expand full comment

I somehow feel reminded of a variant of Campbell's law. Replace social and social indicator with AI benchmarks and that's what may have happened, along the lines of "The more any quantitative AI performance indicator is used for benchmarking, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the processes it is intended to monitor."

Maybe they changed something qualitatively, but I'm afraid OpenAI is just scouring the internet for 4o failures, then they tweak their system, pass, and try to pass the results for AGI. It's pure operationalism. Impressive from a software engineering point of view, but not more than a slightly better and more expensive mouse trap.

Even corporate users may be a little skeptical. If the fully-loaded costs (usage and error correction) of using the new system exceeds the cheapest labor they can get, their bottomline will take a hit. Or, they will stick with a cheaper model. Or, they will go "hybrid", cheap(er) model plus cheap(er) labor = higher profit. For now.

Expand full comment

I'm fairly certain this is not it, as in my understanding the internet doesn't contain any examples of failures on the semi-private evaluation set. https://arcprize.org/arc-agi-pub

Expand full comment

There’s another amorphous thing called common sense. It’s hard to define, but I’ve worked with a lot of bright but narrow minds in my career that seem to lack. Often these people lack social skills. Or empathy. They can often excel at one task but fail at others ( e.g., parenting, relationships). What is concerning to me is this AGI may be being developed by the tech bros that value this standardized test kind of measurement. I don’t think that narrow focus is “better at most human tasks”, unless you define the tasks that narrowly.

Expand full comment

If it was genuinely human-level AGI they would have called it GPT5

Expand full comment

"True AGI requires reasoning frameworks that ensure correctness, adaptability, and logical consistency." 🤔 I would qualify this - that humans manage inconsistency (they juggle different contexts) and do not just follow logical reasoning patterns. And the constraint on flights of fancy is bio-physical, not logical.

Expand full comment

Look forward to trying out o3 and seeing if it does better than o1 which fails simple post-Turing tests.

Expand full comment

May I just say that "clean unfamiliar houses and apartments" might exclude a lot of hetero-sexual men from AGI?! ;)

Expand full comment

I subscribed, but the newsletter still has a lot of omitted items--like the free version. What's up with that?

Expand full comment

Gary, you are again making a mistake. This is not the end product, this is the beginning of the end and merely a demonstration of what's possible. I agree that dealing with open ended problems is maybe not its forte, but the possibility that it could brainstorm with humans to come up with solutions to that is quite real. Instead of perpetually playing devil's advocate, I think your or anybody else's labour for that matter would be better employed at figuring out ways it could be made to work rather than constantly pointing out where it doesn't.

Expand full comment