Discussion about this post

User's avatar
Gerben Wierda's avatar

The simplest way to describe what the o1/o3 GPTs probably basically do is this: instead of being finetuned on generating output directly, it is finetuned on generating CoT steps (in a number of specific domains). For this it has been RLHF'd.

So instead of approximating the textual results, it approximates the textual form of the reasoning steps. It also heavily features the generation of multiple continuations in parallel with evaluation functions and pruning of less scoring ones. Both approaches substantially grow the amount of basic transformer-work (hence the expense). But it still remains the same approximation approach, though because it approximates 'reasoning text' it works better for (specific) 'reasoning tasks'. This improves *on those specific tasks* the approximation quality. But the price is inference cost and more brittleness. Which is why it isn't *generally* better. Some tasks, standard 4o will do better, for instance those that revolve around regenerating meaning out of training material where not so much reasoning but 'memory' is required.

Impressive engineering, certainly, but not a step to AGI, as the core remains 'without understanding brute force approximating the results of understanding'

(My quick educated guess, this)

Expand full comment
Bruce Olsen's avatar

I worked in database R&D for decades, and for a time the big relational vendors (Oracle, Informix, Sybase, Ingres) were competing based on performance. This led to the creation of a variety of vendor-neutral benchmarks. One might simulate a simple banking transaction, another would simulate a high volume, bulk database update.

Before long every vendor started creating what were called "benchmark specials" by tailoring their software toward the needs of each benchmark suite. Sometimes these capabilities were not considered safe (because they could compromise database integrity, typically) so they couldn't be added to the mainline product; rather, they'd be activated by secret parameters specified at the start of the benchmark run.

The entire Valley has known about benchmark specials for decades. There's a famous 1981 case where a company called Paradyne created fake hardware in order to win a $115 million Social Security contract. The SEC said one device was “nothing more than an empty box with blinking lights” (Paradyne countered by claiming the “empty box” was intended to show how the final product would work).

So I wouldn't rule out the creation of a "benchmark special"

Expand full comment
67 more comments...

No posts