๐ผ๐ฏ โ๐๐ฅ๐ ๐๐๐โ ๐ฝ๐ผ๐๐๐บ๐ผ๐ฟ๐๐ฒ๐บ ๐บ๐ฒ๐ด๐ฎ๐๐ต๐ฟ๐ฒ๐ฎ๐ฑ: ๐๐ต๐ ๐๐ต๐ถ๐ป๐ด๐ ๐ด๐ผ๐ ๐ต๐ฒ๐ฎ๐๐ฒ๐ฑ, ๐๐ต๐ฎ๐ ๐๐ฒ๐ป๐ ๐๐ฟ๐ผ๐ป๐ด, ๐ฎ๐ป๐ฑ ๐๐ต๐ฎ๐ ๐ถ๐ ๐ฎ๐น๐น ๐บ๐ฒ๐ฎ๐ป๐
Kevin Roose, of Hard Fork and NYT, was so impressed with OpenAIโs rollout that he joked โof course they have to announce AGI the day my vacation startsโ.
For many people, what sealed the deal, or lead them to conclude, wrongly, that o3 necessarily โmust be a step to AGIโ, was o3โs performance on @fcholletโs ARCโAGI.
Yesterday (after my last post) a war erupted over what was actually done. Hereโs what you should know:
1. As NYU prof Brenden Lake pointed out, the test should never have been called ARC-AGI. Even Chollet acknowledged this in his blog, saying โitโs not an acid test for AGIโ. At *most* the test is necessary for AGI; it certainly isnโt sufficient. Critical things like factuality, compositionality, and common sense arenโt even addressed.
2. The video should have been much clearer about what was actually tested and what was actually trained. To the average listener it may have sounded like the AI took the test cold, with a few sample items, like a human would, but thatโs not actually what happened.
3. What was actually done - pretraining on what I believe was hundreds of public examples - is NOT comparable to what humans require. Such pretraining is not uncommon in the field, but was not made clear in the video. Altman saying that the test wasnโt โtargetedโ added to the confusion.
4. Because of the pretraining, and lack of comparability, what was actually shown was disappointing. Thom Wolf, cofounder of HuggingFace, wrote โpeople commenting that it's normal to train on the train set but somehow I would have expected/hoped that as we're nearing AGI-level capabilities we would not need to really fine-tune/specifically train the model on any specific downstream taskโ
5. Two graphs, one presented by OpenAI and one by Chollet were misleading. As the DeepMindโs @olcan pointed out, the Chollet blog version made the breakthrough seem bigger than it really was by omitting results from others like the @jacobandreas lab at MIT. Same was true of the openAI graph: the MIT work (halfway in between o1 and o3) and many others results werenโt shown, making the breakthrough relative to the field seem far bigger than it really was.
6. As the scientist Adan Becerra, PhD put it (and Chollet publicly agreed) the best thing would have been to present data for the โbase modelโ without the pretraining. This is what many people thought they saw, and that # is important scientifically. Unfortunately the key test was not done.
7. The way in which influencers tried to frame my legitimate criticism as being exclusively about my personal alleged bias was intellectually dishonest. Many others, including researchers from HuggingFace, DeepMind, NYU, and Huawei, and even Chollet himself, shared many of my concerns. Every single point I made was shared by at least one other researcher with a PhD.
Conclusions
โ The problem wasnโt the task per se (a fine addition to our benchmark collection), or even how it was administered (legit relative to the testโs rules), itโs in the impression that OpenAI conveyed, which left many (not all) people believing that more had been shown than actually was.
โ We still donโt have a solid test of what o3 does without the pretraining, in the case that would be more comparable to humans
- Because the wrong experiment was performed, and key data werenโt given, we canโt compare directly with humans. (And best humans still outperformed the model).
โ Until there is considerable external scientific scrutiny (so far there has been none), we wonโt really know exactly what the o3 advance is or how important it is.
โ What we saw is not AGI. Both Chollet and OpenAIโs Anup made this clear, but only after the live video.
โ People in the media probably shouldnโt even joke about o3 being AGI. The media should be asking hard questions, not fanning hype.
OpenAI is not an AGI lab, it's a persuading-people-to-give-them-money-on-the-basis-of-some-vague-optimistic-promise lab. That's what they're really good at. That's what the demo is.
considering all the hyperbole announcements and selective benchmark releases from frontier labs, I donโt even read them but wait for levelheaded AI experts like Gary and read their take instead