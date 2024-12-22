Kevin Roose, of Hard Fork and NYT, was so impressed with OpenAI’s rollout that he joked “of course they have to announce AGI the day my vacation starts”.

For many people, what sealed the deal, or lead them to conclude, wrongly, that o3 necessarily “must be a step to AGI”, was o3’s performance on @fchollet’s ARC—AGI.

Yesterday (after my last post) a war erupted over what was actually done. Here’s what you should know:

1. As NYU prof Brenden Lake pointed out, the test should never have been called ARC-AGI. Even Chollet acknowledged this in his blog, saying “it’s not an acid test for AGI”. At *most* the test is necessary for AGI; it certainly isn’t sufficient. Critical things like factuality, compositionality, and common sense aren’t even addressed.

2. The video should have been much clearer about what was actually tested and what was actually trained. To the average listener it may have sounded like the AI took the test cold, with a few sample items, like a human would, but that’s not actually what happened.

3. What was actually done - pretraining on what I believe was hundreds of public examples - is NOT comparable to what humans require. Such pretraining is not uncommon in the field, but was not made clear in the video. Altman saying that the test wasn’t “targeted” added to the confusion.

4. Because of the pretraining, and lack of comparability, what was actually shown was disappointing. Thom Wolf, cofounder of HuggingFace, wrote “people commenting that it's normal to train on the train set but somehow I would have expected/hoped that as we're nearing AGI-level capabilities we would not need to really fine-tune/specifically train the model on any specific downstream task”

5. Two graphs, one presented by OpenAI and one by Chollet were misleading. As the DeepMind’s @olcan pointed out, the Chollet blog version made the breakthrough seem bigger than it really was by omitting results from others like the @jacobandreas lab at MIT. Same was true of the openAI graph: the MIT work (halfway in between o1 and o3) and many others results weren’t shown, making the breakthrough relative to the field seem far bigger than it really was.

6. As the scientist Adan Becerra, PhD put it (and Chollet publicly agreed) the best thing would have been to present data for the “base model” without the pretraining. This is what many people thought they saw, and that # is important scientifically. Unfortunately the key test was not done.

7. The way in which influencers tried to frame my legitimate criticism as being exclusively about my personal alleged bias was intellectually dishonest. Many others, including researchers from HuggingFace, DeepMind, NYU, and Huawei, and even Chollet himself, shared many of my concerns. Every single point I made was shared by at least one other researcher with a PhD.

Conclusions

⁃ The problem wasn’t the task per se (a fine addition to our benchmark collection), or even how it was administered (legit relative to the test’s rules), it’s in the impression that OpenAI conveyed, which left many (not all) people believing that more had been shown than actually was.

⁃ We still don’t have a solid test of what o3 does without the pretraining, in the case that would be more comparable to humans

- Because the wrong experiment was performed, and key data weren’t given, we can’t compare directly with humans. (And best humans still outperformed the model).

⁃ Until there is considerable external scientific scrutiny (so far there has been none), we won’t really know exactly what the o3 advance is or how important it is.

⁃ What we saw is not AGI. Both Chollet and OpenAI’s Anup made this clear, but only after the live video.

⁃ People in the media probably shouldn’t even joke about o3 being AGI. The media should be asking hard questions, not fanning hype.

