Kevin Roose, of Hard Fork and NYT, was so impressed with OpenAIโs rollout that he joked โof course they have to announce AGI the day my vacation startsโ.
OpenAI is not an AGI lab, it's a persuading-people-to-give-them-money-on-the-basis-of-some-vague-optimistic-promise lab. That's what they're really good at. That's what the demo is.
considering all the hyperbole announcements and selective benchmark releases from frontier labs, I donโt even read them but wait for levelheaded AI experts like Gary and read their take instead
Ad hominem is the AI'ers immediate Go-To tactic. Next is twisting your words or misquoting to make it seem you said things you didn't say. Then there is always the good old dismissal of your criticisms by saying they can be refuted, like Dreyfus, with "a few simple words" they never get around to writing. Finally, when it becomes apparent, even to them, the criticism are justified they will whine you are a big meanie who hurt their feelings.
I feel like an observing witness at an evangelical tent revival. Otherwise intelligent people are lead by their faith to queue up to reach salvation through a healing blow to the head administered by a frocked sama at the pulpit.
I think Chollet has done great work with ARC-AGI. The fact that a statistical approach has been relatively successful merely demonstrates how far away AGI really is. We don't have algorithms that can be brought to bear on ARC-AGI that approach the problem as a human would. (Or, if we do, their human creators didn't enter the contest.)
I look forward to the next generation of ARC-AGI. I believe one of the team's goals is to create a test that is harder for deep learning algorithms to tackle. Detractors will undoubtedly claim that the new test is unfairly biased against their favorite algorithm, but true fans of AGI will say, "This is the way."
It's the AI community that keeps complaining that the goal posts keep getting moved. "True fact" - from Day One, the goal [which needs no moving] was to mimic human intelligence. AI has been a grab bag of hype-filled one-offs, never getting anywhere to the original goal.
No, the goal isn't to mimic human intelligence. Humans have a very severe limit on their input - the Four Pieces Limit. We need to transcend this limit.
Jim, please read the ARC contest's fully. The goal *is* to mimic human intelligence. That's true not just for ARC, but for *all* of AI - again, look it up (the Dartmouth Summer Conference on AI, 1955).
I am sure what you say is true, but humans havge a severe limi8t on their intelligence - the FourPiecesLimit.com. That leads to horrendous mistakes when things get complex - maybe 10 pages of text. We need to do better, and fooling around with statistics to make something work is not the way to do it. It would be useful to explain to the machine in our native language what is expected.
There is a semantic AI model (SAM) that is complimentary to large language models (LLM) that contributes facts and reasoning to the AGI in a transparent way.
Surrounding an LLM by facts and reasoning is not going to work. When will you guys realize that you need to dump the LLM when it comes to AGI? You will always be working around their problems. It's just word statistics and humans do not reason or understand based on word statistics. LLMs are useful but not when it comes to AGI. An AGI may consult an LLM, say when it needs to generate text in Shakespearian style.
This video (1m19s): [ https://bit.ly/3WuGyxE ] shows a new way to access generative AI using a promptless interface. Learn about the AICYC project [www.aicyc.org] dedicated to ending knowledge poverty.
Wrapping an LLM with facts included reasoning about those facts.
I agree but LLMs have the money and problem. So surrounding them with a semantic AI wrapper solving some of their most pressing problems is a business decision.
A semantic model should not be seen as complementary to LLMs - if it understands text then it replaces all that an LLM can do. You are comparing something that understands text with something that understands not a word - what does an LLM do with a word that might have 60 meanings ("set") or 80 ("on"). The reliability of an LLM is far too low to be used on anything important - it is no more than an amusing toy.
I don't care what an LLM does. A semantic model or symbolic AI must demonstrate how it corrects LLM with formal proof. That is the case for intellisophic.net products. We are certified by a U S Government agency NIST and international agencies.
Semantic AI model is complimentary to LLM as reading and writing. Your points about polysemy iis why LLM needs SAM. SAM can't write but it can read and detect errors caused in part by polysemy.
George, we obviously have very different ideas about the use of semantics. I use four cases:
Robodebt โ loss of 1. billion dollars, 2 suicides โ lawyers lied to benefit their political masters
Horizon โ loss of 1 billion pounds, 4 suicides - โthe program never makes a mistakeโ
Boeing 737 Max โ web of lies, loss of 346 lives, Boeing loses at least 20 billion dollars
F-35 โ hundreds of billions wasted
A version of the F-35 was meant to land on a carrier, but if it had just taken off and had to land immediately with a full load of fuel, the undercarriage would be smashed. The specification for the undercarriage ran to 3000 pages.
These are problems where the machine has to see the full problem in abstract action (in a way that a human cannot), not cobble together little pieces in an LLM sea of unknowingness.
We use Dempster-Shafer to handle beliefs - maybe what you call lies. Solipsistic reasoning starts in the mind of a single person. Where does it start in a formal model?
โWe use Dempster-Shafer to handle beliefs - maybe what you call lies.โ
Dempster-Shafer is a dated method, when we couldnโt do it any better. Now, we can bring a document alive using Active Semantics, and the system can find all the inconsistencies, errors, and omissions.
When the belief system is a one-liner โDo whatever it takesโ, analysing the belief system is a waste of time.
A personal experience โ fighting over a trademark with Google.
Googleโs brief says that our use of more than one trademark is โfatal to our causeโ.
IPA (Intellectual Property Australia) says a trader can use multiple trademarks on the same goods. An example of turning the law on its head.
Google says our software should be restricted to a single field, after registering Gemini, which Google claims to be useful in ten fields.
The arguments are so bad, I did not see how a judge could be persuaded. Google has set up a Confidential channel to the judge, so the other party (me) cannot know what they have told the judge.
Given the reputation for ruthlessness of U.S. law firms, I donโt see that belief system analysis is all that useful.
Another example โ the specifications for the F-35. Hundreds of billions wasted. The parties involved โ DoD, the contractor and many subcontractors, politicians. Whose belief system?
Gary, you hit on a number of key points. I've endeavored for understanding rather than review the marketing and social media push by those with a vested interest. But I've come a way believing in some clear take aways, related to the cost to automate human performance on a specific test and time to complete that test vs. human performance. If one looks closely at the numbers. The high efficiency version under performed humans but at a significantly faster speed. Typical automation dynamics there. The down side was the cost at getting human performance. We can play a lot of people to solve this test with $350,000.
The notion given by Mr. Wolf of a machine learning program being able to take on any digital task without training specific to the task is total nonsense on its face. How do these guys think machine learning programs work, anyway?
"AGI labs" are trying to create digital golems, the whole research program is 20th century alchemy ("AI" in general is the greatest vaporware project in all of computer science, our philosopher's stone). How many billions of dollars that could have gone to public housing, or a San Fran specific issue, public bathrooms, have instead been blown training toy models that Cali C-suiters believe will give them magic powers if they jam enough reddit threads into it?
How can o3 fail on any ARC-AGI tasks when it supposedly solved ~25% of the FrontierMath problems? Just from the sample published ARC-AGI failures and FrontierMath example problems from their website, the former are basically trivial while the latter can't be cracked by the vast majority of humans on this planet. It's basically like getting 2+2 wrong while correctly solving quantum physics and string theory problems.
Are we quite sure that some shape or form of the FrontierMath problems hasn't been used in training or fine-tuning? After all, AI influencers were impressed by even earlier GPT models solving complex math problems -- except only the ones whose solutions appear on the Internet.
The way I read Challetโs blog post was that he was strongly implying they fine tuned on the test data. Both of the sets: public and โsemi-privateโ are without a doubt in OpenAIโs datasets. In fact, theyโve probably logged the semi-private dataset hundreds of times. He also said that he expected o3 to score <30% on the newer version of ARC. Feels more like cheating out of desperation than AGI, but what do I know?
I'm getting the same impression, doesn't help that another team that scored high wasn't verified for not being open-source, but OAI was despite also not being open-source.
And then you have the limit on compute power that was completely ignored.
I really don't understand why there is not more emphasis on this point about the "semi-private" data set, even from chollet. There is no guarantee that the "semi-private" set is not in the training set for o3. In fact, it could in there without OpenAI explicitly training on it (e.g., to "cheat"), but someone else could have leaked it to the internet and OpenAI could be training on it without knowing it. I think the result is still a big deal / impressive, but there is very little discussion of this huge asterisk in the result.
We're in an era where public discourse seems to be led by folks willing to make unsubstantiated claims, or willing to encourage others to do so, only to walk it back later and hope that no one notices the deception. In this case, the deception leads to doubling down on investment in a model that likely diverts from research into supplemental techniques and technologies. I don't really care if Microsoft or other AI investors lose money (my exposure to them is minimal and those companies have money to burn), but the deception of the public and erosion of critical thinking about science and technology in the press has a societal price we all pay.
At the same time, these LLMs have huge costs in energy ($1000 a query, in some instances, I have read, at a time when energy prices are actually pretty low) that contribute massively to the overheating of our planet. (AI trolls: Don't waste your time telling me that they use solar or hydro, they still displace other uses, causing dirty plants to stay on line longer.) That is a price we all pay, too.
Hi Gary, your analysis is spot on, as usual! It's amazing how people are so eager to latch on to terms ('AGI') with no clue as to what they mean.
ARC solving [alone] isn't going to lead to AGI - same as how, acing an standard IQ test [alone] isn't a measure of someone's intelligence.
Too bad Francois even made that post about o3 [with a disclaimer in it aside], that ended up adding legitimacy.
By the way [not related to ARC], "common sense reasoning" is an oxymoron unless there is embodiment - the 'sense' part involves direct sensing and perceiving of the physical world, and not symbolic calcs (GOFAI), gradient descent calcs (ML) or cost minimization calcs (RL). *No* AI today does actual common sense reasoning, let alone embark on a path to AGI [whatever the heck that even means].
Isn't what you call pretraining actually finetuning? Pretraining on large corpora gets you the basic language patterns as token statistics. GPT3 was done in 2019. Then they worked for 3 years finetuning to 'sanitise' it enough to be put in the hands of the public (that Kenya stuff...)
As my understanding goes: pretraining is masking tokens on text and calculating token selection.
Or does 01/03 actually do this in a form of pretraining? But how then does that overcome the pure 'weight' of the other pretraining? Enquiring minds would like to know.
Altman et AI can only hide behind slick choreographed demos with cherry picked pretrained examples for so long.
Eventually, they will have to release o3 to the public and , like Sora befora, o3 will inevitably have its Olympic gymnastics moment at which point people will realize it is not Simone Biles but the Bride of Frankenstein with her legs sewn into her armpits
I think another aspect of this is the continuing and evolving effect on the public. This wasn't, i think, covered nearly as widely as the (whackadoodle) claims of AI consciousness back in 2022. The public may be becoming saturated with these wolf cries. I wonder what it means. Purely in terms of limiting the stress of the number of people worrying about this i think is good; hopefully it means public literacy about AI is increasing. Surely it helps to have it right in front of them. My eight year old plays chess and wants to know "why chatgpt is so stupid" about playing chess. It will get there of course. But at least these things are becoming less abstract.
I read speculation that Altman's desire to declare AGI has mostly to do with the poison pill clause in the MS contract; wonder what you think about the likelihood of that context. The hype shall continue until morale improves...
OpenAI is not an AGI lab, it's a persuading-people-to-give-them-money-on-the-basis-of-some-vague-optimistic-promise lab. That's what they're really good at. That's what the demo is.
an exercise in marketing. Media are incapable of understanding the distinction
I bet the engineers at OpenAI cringe every time Sam Altman and other sales people talk to the media or investors.
considering all the hyperbole announcements and selective benchmark releases from frontier labs, I donโt even read them but wait for levelheaded AI experts like Gary and read their take instead
Ad hominem is the AI'ers immediate Go-To tactic. Next is twisting your words or misquoting to make it seem you said things you didn't say. Then there is always the good old dismissal of your criticisms by saying they can be refuted, like Dreyfus, with "a few simple words" they never get around to writing. Finally, when it becomes apparent, even to them, the criticism are justified they will whine you are a big meanie who hurt their feelings.
All of this is just comedy for me, btw.
I feel like an observing witness at an evangelical tent revival. Otherwise intelligent people are lead by their faith to queue up to reach salvation through a healing blow to the head administered by a frocked sama at the pulpit.
I think Chollet has done great work with ARC-AGI. The fact that a statistical approach has been relatively successful merely demonstrates how far away AGI really is. We don't have algorithms that can be brought to bear on ARC-AGI that approach the problem as a human would. (Or, if we do, their human creators didn't enter the contest.)
I look forward to the next generation of ARC-AGI. I believe one of the team's goals is to create a test that is harder for deep learning algorithms to tackle. Detractors will undoubtedly claim that the new test is unfairly biased against their favorite algorithm, but true fans of AGI will say, "This is the way."
Paul, BINGO.
It's the AI community that keeps complaining that the goal posts keep getting moved. "True fact" - from Day One, the goal [which needs no moving] was to mimic human intelligence. AI has been a grab bag of hype-filled one-offs, never getting anywhere to the original goal.
No, the goal isn't to mimic human intelligence. Humans have a very severe limit on their input - the Four Pieces Limit. We need to transcend this limit.
Jim, please read the ARC contest's fully. The goal *is* to mimic human intelligence. That's true not just for ARC, but for *all* of AI - again, look it up (the Dartmouth Summer Conference on AI, 1955).
contest's pages
Saty
I am sure what you say is true, but humans havge a severe limi8t on their intelligence - the FourPiecesLimit.com. That leads to horrendous mistakes when things get complex - maybe 10 pages of text. We need to do better, and fooling around with statistics to make something work is not the way to do it. It would be useful to explain to the machine in our native language what is expected.
I think we are exactly aligned. LLM is fundamentally flawed. We just have different solutions.
There is a semantic AI model (SAM) that is complimentary to large language models (LLM) that contributes facts and reasoning to the AGI in a transparent way.
http://aicyc.org/2024/12/22/no-agi-without-semantic-ai/
Surrounding an LLM by facts and reasoning is not going to work. When will you guys realize that you need to dump the LLM when it comes to AGI? You will always be working around their problems. It's just word statistics and humans do not reason or understand based on word statistics. LLMs are useful but not when it comes to AGI. An AGI may consult an LLM, say when it needs to generate text in Shakespearian style.
This video (1m19s): [ https://bit.ly/3WuGyxE ] shows a new way to access generative AI using a promptless interface. Learn about the AICYC project [www.aicyc.org] dedicated to ending knowledge poverty.
Wrapping an LLM with facts included reasoning about those facts.
http://aicyc.org/2024/12/11/sam-implementation-of-a-belief-system/
Or when it needs to concoct a plausible tall tale, perhaps also in Shakespearean style
I agree but LLMs have the money and problem. So surrounding them with a semantic AI wrapper solving some of their most pressing problems is a business decision.
That is precisely what a semantic AI model (SAM) does. Thanks for the boost.
A semantic model should not be seen as complementary to LLMs - if it understands text then it replaces all that an LLM can do. You are comparing something that understands text with something that understands not a word - what does an LLM do with a word that might have 60 meanings ("set") or 80 ("on"). The reliability of an LLM is far too low to be used on anything important - it is no more than an amusing toy.
https//www.activesemantics.com
I don't care what an LLM does. A semantic model or symbolic AI must demonstrate how it corrects LLM with formal proof. That is the case for intellisophic.net products. We are certified by a U S Government agency NIST and international agencies.
http://aicyc.org/2023/08/02/llm-ai-hallucination/
http://aicyc.org/2024/10/05/how-sam-thinks/
Semantic AI model is complimentary to LLM as reading and writing. Your points about polysemy iis why LLM needs SAM. SAM can't write but it can read and detect errors caused in part by polysemy.
Here is how SAM-1 (intellisophic.net) finds Hallucinations
http://aicyc.org/2023/08/02/llm-ai-hallucination/
Vision Video
https://vimeo.com/1030909563
George, we obviously have very different ideas about the use of semantics. I use four cases:
Robodebt โ loss of 1. billion dollars, 2 suicides โ lawyers lied to benefit their political masters
Horizon โ loss of 1 billion pounds, 4 suicides - โthe program never makes a mistakeโ
Boeing 737 Max โ web of lies, loss of 346 lives, Boeing loses at least 20 billion dollars
F-35 โ hundreds of billions wasted
A version of the F-35 was meant to land on a carrier, but if it had just taken off and had to land immediately with a full load of fuel, the undercarriage would be smashed. The specification for the undercarriage ran to 3000 pages.
These are problems where the machine has to see the full problem in abstract action (in a way that a human cannot), not cobble together little pieces in an LLM sea of unknowingness.
We use Dempster-Shafer to handle beliefs - maybe what you call lies. Solipsistic reasoning starts in the mind of a single person. Where does it start in a formal model?
http://aicyc.org/2024/12/11/sam-implementation-of-a-belief-system/
George,
โWe use Dempster-Shafer to handle beliefs - maybe what you call lies.โ
Dempster-Shafer is a dated method, when we couldnโt do it any better. Now, we can bring a document alive using Active Semantics, and the system can find all the inconsistencies, errors, and omissions.
When the belief system is a one-liner โDo whatever it takesโ, analysing the belief system is a waste of time.
Some relevant blogs โ
Boeing 737 https://semanticstructure.blogspot.com/2024/09/lies.html
Robodebt https://semanticstructure.blogspot.com/2022/12/reading-legislation-and-robo-debt-based.html
A personal experience โ fighting over a trademark with Google.
Googleโs brief says that our use of more than one trademark is โfatal to our causeโ.
IPA (Intellectual Property Australia) says a trader can use multiple trademarks on the same goods. An example of turning the law on its head.
Google says our software should be restricted to a single field, after registering Gemini, which Google claims to be useful in ten fields.
The arguments are so bad, I did not see how a judge could be persuaded. Google has set up a Confidential channel to the judge, so the other party (me) cannot know what they have told the judge.
Given the reputation for ruthlessness of U.S. law firms, I donโt see that belief system analysis is all that useful.
Another example โ the specifications for the F-35. Hundreds of billions wasted. The parties involved โ DoD, the contractor and many subcontractors, politicians. Whose belief system?
Gary, you hit on a number of key points. I've endeavored for understanding rather than review the marketing and social media push by those with a vested interest. But I've come a way believing in some clear take aways, related to the cost to automate human performance on a specific test and time to complete that test vs. human performance. If one looks closely at the numbers. The high efficiency version under performed humans but at a significantly faster speed. Typical automation dynamics there. The down side was the cost at getting human performance. We can play a lot of people to solve this test with $350,000.
The notion given by Mr. Wolf of a machine learning program being able to take on any digital task without training specific to the task is total nonsense on its face. How do these guys think machine learning programs work, anyway?
"AGI labs" are trying to create digital golems, the whole research program is 20th century alchemy ("AI" in general is the greatest vaporware project in all of computer science, our philosopher's stone). How many billions of dollars that could have gone to public housing, or a San Fran specific issue, public bathrooms, have instead been blown training toy models that Cali C-suiters believe will give them magic powers if they jam enough reddit threads into it?
How can o3 fail on any ARC-AGI tasks when it supposedly solved ~25% of the FrontierMath problems? Just from the sample published ARC-AGI failures and FrontierMath example problems from their website, the former are basically trivial while the latter can't be cracked by the vast majority of humans on this planet. It's basically like getting 2+2 wrong while correctly solving quantum physics and string theory problems.
Are we quite sure that some shape or form of the FrontierMath problems hasn't been used in training or fine-tuning? After all, AI influencers were impressed by even earlier GPT models solving complex math problems -- except only the ones whose solutions appear on the Internet.
OpenAI has reportedly hired mathematicians to solve math problems, whose solutions are then used to train GPT.
None of OpenAIs claims should be accepted without being independently verifAIโd.
Itโs actually absurd that a โdisciplineโ that some actually call computer โscienceโ is performed in such an opaque, unscientific way.
Itโs an embarrassment to legitimate computer scientists โ or at least should be.
The way I read Challetโs blog post was that he was strongly implying they fine tuned on the test data. Both of the sets: public and โsemi-privateโ are without a doubt in OpenAIโs datasets. In fact, theyโve probably logged the semi-private dataset hundreds of times. He also said that he expected o3 to score <30% on the newer version of ARC. Feels more like cheating out of desperation than AGI, but what do I know?
I'm getting the same impression, doesn't help that another team that scored high wasn't verified for not being open-source, but OAI was despite also not being open-source.
And then you have the limit on compute power that was completely ignored.
I really don't understand why there is not more emphasis on this point about the "semi-private" data set, even from chollet. There is no guarantee that the "semi-private" set is not in the training set for o3. In fact, it could in there without OpenAI explicitly training on it (e.g., to "cheat"), but someone else could have leaked it to the internet and OpenAI could be training on it without knowing it. I think the result is still a big deal / impressive, but there is very little discussion of this huge asterisk in the result.
We're in an era where public discourse seems to be led by folks willing to make unsubstantiated claims, or willing to encourage others to do so, only to walk it back later and hope that no one notices the deception. In this case, the deception leads to doubling down on investment in a model that likely diverts from research into supplemental techniques and technologies. I don't really care if Microsoft or other AI investors lose money (my exposure to them is minimal and those companies have money to burn), but the deception of the public and erosion of critical thinking about science and technology in the press has a societal price we all pay.
At the same time, these LLMs have huge costs in energy ($1000 a query, in some instances, I have read, at a time when energy prices are actually pretty low) that contribute massively to the overheating of our planet. (AI trolls: Don't waste your time telling me that they use solar or hydro, they still displace other uses, causing dirty plants to stay on line longer.) That is a price we all pay, too.
Thanks for the skepticism. It's much needed.
Your point #7 about influencers being intellectually dishonest made me laugh... As if they ever had the 'intellect' or the 'honesty' to start with.
Hanlon's Razor
Hi Gary, your analysis is spot on, as usual! It's amazing how people are so eager to latch on to terms ('AGI') with no clue as to what they mean.
ARC solving [alone] isn't going to lead to AGI - same as how, acing an standard IQ test [alone] isn't a measure of someone's intelligence.
Too bad Francois even made that post about o3 [with a disclaimer in it aside], that ended up adding legitimacy.
By the way [not related to ARC], "common sense reasoning" is an oxymoron unless there is embodiment - the 'sense' part involves direct sensing and perceiving of the physical world, and not symbolic calcs (GOFAI), gradient descent calcs (ML) or cost minimization calcs (RL). *No* AI today does actual common sense reasoning, let alone embark on a path to AGI [whatever the heck that even means].
Typo: "4. ... and ack of ...
Ack! and I ACK! and I fixed, thanks
Isn't what you call pretraining actually finetuning? Pretraining on large corpora gets you the basic language patterns as token statistics. GPT3 was done in 2019. Then they worked for 3 years finetuning to 'sanitise' it enough to be put in the hands of the public (that Kenya stuff...)
As my understanding goes: pretraining is masking tokens on text and calculating token selection.
Or does 01/03 actually do this in a form of pretraining? But how then does that overcome the pure 'weight' of the other pretraining? Enquiring minds would like to know.
Altman et AI can only hide behind slick choreographed demos with cherry picked pretrained examples for so long.
Eventually, they will have to release o3 to the public and , like Sora befora, o3 will inevitably have its Olympic gymnastics moment at which point people will realize it is not Simone Biles but the Bride of Frankenstein with her legs sewn into her armpits
https://arstechnica.com/information-technology/2024/12/twirling-body-horror-in-gymnastics-video-exposes-ais-flaws/
Appreciate your thoughts and time as always.
I think another aspect of this is the continuing and evolving effect on the public. This wasn't, i think, covered nearly as widely as the (whackadoodle) claims of AI consciousness back in 2022. The public may be becoming saturated with these wolf cries. I wonder what it means. Purely in terms of limiting the stress of the number of people worrying about this i think is good; hopefully it means public literacy about AI is increasing. Surely it helps to have it right in front of them. My eight year old plays chess and wants to know "why chatgpt is so stupid" about playing chess. It will get there of course. But at least these things are becoming less abstract.
I read speculation that Altman's desire to declare AGI has mostly to do with the poison pill clause in the MS contract; wonder what you think about the likelihood of that context. The hype shall continue until morale improves...