50 Comments
Sep 12Liked by Gary Marcus

Remind me, we need this, why?

Expand full comment

So one percenters can live effortlessly in their satellites cruising high above the fray.

Expand full comment

To say we do not need agents that work through tricky problems step by step shows a failure of imagination.

Expand full comment

I use the models to generate code and search for research papers. (Like perplexity.ai). A good search engine to approximate research.

Expand full comment

I really, really can't think why we need this.

Expand full comment

I am VERY curious about whether they have started to incorporate symbolic reasoning or some remnant of good old fashioned AI here. In fact, I thought of you when reading the Verge coverage. "The model hallucinates less" hardly raises the bar. https://www.theverge.com/2024/9/12/24242439/openai-o1-model-reasoning-strawberry-chatgpt

Expand full comment

Hallucinating less also means that it stays on whatever dumf**k track it lands on. This is getting worse with Google. It picks some stupid guess about what an ignoramus might have mistyped and... that's the end.

Expand full comment

Does any one notice that all the chatbots do is providing straight through response to a human query? That fact by itself is evidence that the chatbots (CoT like o1 or not) harbor no understanding whatsoever about the subject matter in the original human query. Notice they never ask you a question back for clarification. Why? Because they are not trying to understand. They are just trying to produce the next token word. The whole LLM architecture is just for show. There is no substance.

Expand full comment

Interesting.

Expand full comment

The chatbots never ask for clarification because they are clairvoyant.

It’s part of their training, along with spoonbending.

Expand full comment

Word has it that Sam Altman is in charge of the spoonbending training.

Expand full comment

In fact, they know what you are going to say before you even say it —so prompts are really superfluous.

Expand full comment

Wow. This is one of those things that seems so obvious when you hear it, even though I'd never heard it pointed out before. I have had GPT4 ask me questions, but never for the purpose of clarifying what it is that I'm asking. It's only ever happened when it's being "chatty", which I assume comes about from the reinforment learning.

Expand full comment

While impressive on some tasks, it seems just as fragile in some of the familiar places. Doesn’t hurt to remind ourselves that last year The Information and Reuters, right after Sam Altman's ouster, spread the rumor that a model called Q* could "threaten humanity." And OpenAI was happy to ride that wave. A year later here it is… the first “reasoning” model. A tweet from Clem, CEO of HuggingFace, posted after today’s announcement, I think said it all: https://x.com/ClementDelangue/status/1834283206474191320

Expand full comment
Sep 12Liked by Gary Marcus

Sounds about right. The claims about math tests sounded like more of the same. Thanks as always for keeping things science based, Gary.

Expand full comment
Sep 12Liked by Gary Marcus

Side note: I found out about it because an IT guy wanted me to know it can count the Rs in strawberry. Groundbreaking stuff.

Expand full comment
Sep 12·edited Sep 12

The chatbot will be able to work step-by-step and explain what it did. That is a huge deal.

Expand full comment

Yes, although the way they’ve made it fake “hm, I’m thinking” like a person is another creepy choice by OpenAI. The affordance to view the work is a win.

Expand full comment

In their blog post "Open"AI said explicitly that they would actually be hiding the "chain of thought" from end users, providing only a model-based summary (which of course need not be accurate)

Expand full comment

Hiding the chain of thought?

Since there is no actual thought involved, hiding it is pretty easy.

It’s just like hiding empty space.

Expand full comment

Touché.

I'm just using the anthropomorphized lingo they use. Call it whatever you like, all those tokens they generate between the query and final answer are hidden from view.

Expand full comment

To be fair to "Open"AI, the example in the blog post, if real, was kind of impressive since it involved decoding a just-slightly-nontrivial cipher which encoded the sentence "there are 3 r's in strawberry". So it solved a modestly interesting puzzle (but didn't actually count the r's in "strawberry" 😂). But these examples are always cherry-picked (er, strawberry-picked ) for the advertising material.

Expand full comment

It seems these new OpenAI models are using a similar approach to Google DeepMind's AlphaProof and AlphaGeometry. This approach combines LLMs (Large Language Models) with a theorem prover (based on symbolic logic), reminiscent of the classic AI meta-algorithm "generate-and-test." However, they've added a "train" step that uses solutions validated by the theorem prover to fine-tune the LLM through low-rank adaptation (LoRA). This avoids the need to retrain the entire pre-trained model. My LinkedIn post: https://www.linkedin.com/pulse/strawberry-alphaproof-gofai-rescue-generative-ai-claude-coulombe-kxane/

Expand full comment

The longer response time could be explained by the trial of several solutions and then a selection of the solution by majority vote. That's a well known « advanced prompting » techniques. OpenAI is clear on that point... 😉 in their post which is probably ChatGPT generated. 🙂

Expand full comment

They should really rename "Chain of Thoughts" to "Chain of Hallucinations".

Expand full comment

What I am curious about is to what extent improvements on the narrow assessment tests actually translates into recognisable common sense improvements.

No matter how leaky the training sets and metrics are, and how narrow the testing, the "look how big the bars are" effect is undeniable and does rock my scepticism each time.

I suppose my question is how can they (the models) keep climbing up these ever-newly-appearing metrics and yet remain so unimpressive to use...

Expand full comment

I work in the Tech Dept. for a non-tech business, and it is getting quite tedious. Internal calls about how amazing AI is followed by hands-on from staff who try it, find it cumbersome or too-generic, and return to their actual jobs.

The sooner this implodes, the better.

Expand full comment

I work in education. It is even worse here.

I raise this glass for the impending implosion. Here-here.

Expand full comment

Watching the AI news these days is just painful. Vast sums of money wasted on AGI kindergarten, while actual AGI research withers on the vine. "We have named our species Homo sapiens — the wise human. But it is debatable how well we have lived up to the name." (Harari, "Nexus", 2024).

Expand full comment

Honestly, anyway you slice it this is just a recalibration of Altman/OpenAI hype. Pure bullshit. There are laws in nature we can’t get around that makes things so. So I wonder why we waste time with science fiction?

Expand full comment

“there's this question which has been debated in the field for a long time: what do we have to do in addition to a language model to make a system that can go discover new physics?"

“In addition to”?

What if being an LLM chatbot (predicting the next token in a sequence based on statistics of what has been produced before) is fundamentally incompatible with being a theoretical physicist whose job it is to come up with new physics?

The proposition that the two might be incompatible doesn’t seem particularly outlandish.

Expand full comment

That plot on the right makes me wanna puke. o1 outperforms "expert humans" on "PhD level science questions"? And it does this by... predicting the next token over and over and over? I have a guess as to how it accomplishes this, and it's got f all to do with knowing anything about science.

Expand full comment

“PhD level science questions”?

According to whom?

Sam Altman, the undergrad dropout?

One thing is certain: what they are doing at OpenAI is certainly not “science” by any stretch of the word (not even freshman level)

To even call it “engineering” is a very long reach.

Tweaking an LLM whose inner workings they have close to (if not precisely) zero understanding of is not science OR engineering, at least not to anyone who actually studied either of these beyond the junior high level.

Expand full comment

These people just distract from other work which makes significant contributions. By inserting reasoning tokens playing with the temperature parameter etc. They just created a big messy hack that gets us nowhere. Trying to justify their prices. It’s MSDOS on steroids. They carried their three character limit all the way to the internet area “htm”. just bad hacks.

Expand full comment

If "we need another breakthrough", why do so many believe that this breakthrough is coming very soon? I see no reason at all to believe that.

Expand full comment

Some undoubtedly believe it because con artists have led them to believe that it is so with incessant hype.

…and they are suckers (especially the investors who stand to lose billions)

Expand full comment

Yeah, but even Gary writes as if this is going to happen reasonably soon. I just don't see any particular reason to think that.

Expand full comment

See https://platform.openai.com/docs/guides/reasoning for a description of how it uses an LLM to parse the input prompt and then (presumably via a hallucinating transformer) creates a form of chain of thought series of "reasoning tokens" that is added to the original prompt and re-input to the transformer to further hallucinate the final output.

You get charged for all the reasoning tokens, even though you do not see them, as that page warns that you could get up to 25k reasoning tokens. It seems that they also fix the temperature setting at 1 for these preview and mini versions, which is, as we all know, the setting for being away with the fairies.

This smacks of the use of GPT4 to create the prompts for DALL-E3 in Sora, hallucinations compounding hallucinations. A recipe for catastrophe.

Expand full comment

I don't understand much besides everything seems to sound like "maybe".

It's all about venture capital, no?

Expand full comment