Grok 3 Beta in Shambles

Feb 19

Maximal Truth still seems far away

91 Comments

Half of the voting population in this country will go to their graves convinced that we spent 50 million dollars on condoms for Hamas because it was said once by the right person. Doesn't matter how wrong it was or how many times it has been and will be corrected. We don't need to invent a deception machine in order to make everyone dumber, but we won't let that stop us. Prayers for Billings.

Expand full comment

Reply (1)

Matt Hawthorn

Just purely out of curiosity, can you point to any credible reporting on that Hamas condom thing? Preferably something that doesn't omit links to source documents. I.e. I don't want just an utterance by "the right person" who in this case would just be more aligned with me than the other guy who said it and was someone else's "right person".

Expand full comment

Reply (2)

Bill Dempsey

https://www.youtube.com/watch?v=6NaOaodF0lM

Expand full comment

Reply (1)

Matt Hawthorn

Oof. Yeah he's really bullshitting us there.

But also, https://www.usaspending.gov/award/ASST_NON_SPRMCO24VC0339_1900

Expand full comment

Notorious P.A.T.

https://apnews.com/article/gaza-condoms-fact-check-trump-50-million-26884cac6c7097d7316ca50ca4145a82

Expand full comment

Dakara

Intelligence is efficient. While we are talking of building nuclear reactors to power what is hoped to be the equivalent intelligence of a 12-20 watt human brain, we are on the wrong path.

Expand full comment

Reply (5)

Alex SL

This is a mostly underappreciated aspect of the bubble. The idea behind the investments and valuations is that genAI will create efficiencies and value largely by replacing human labour. But if it is so expensive to train and run that a human is cheaper, even a hypothetical model without 'hallucinations' loses out. 'Training' a human from birth to university degree is only a few hundred thousand dollars even in an expensive country, variously less in some others, and they are more versatile than any model will ever be.

Expand full comment

Reply (1)

Andy X Andersen

What you are missing out is that the cost of training a large model may be tens of billions, but then a smart model can replace millions workers who are paid 100k per year. Granted, we are not there yet.

Expand full comment

Kash

It seems weird that companies keep saying they need more data, a billion hours of YouTube and every publicly available text isn’t enough?

Expand full comment

Reply (1)

Blue

They say it hoping you'll fall for it. See my larger comment elsewhere

Expand full comment

Andy X Andersen

Our brains are efficient, yes, but they are also immensely powerful. It is estimated that we pack around 1 exaflop, so comparable to a recent supercompuer. So, we need to improve the efficiency of hardware, regardless of what algorithms we run on top of it.

After 70 years of intense research, there's no hope for low-compute architecture. Our best bet for now is to improve AI as much as possible, regardless of how much compute it takes, and then learn how to improve efficiency later.

Expand full comment

Giacomo

To be fair, an LLM is more like a near-infinite amount of 12-20 watt human brains. Once you have trained it the limiting factor is server space, not electrical power.

Expand full comment

Reply (1)

Ilia Kurgansky

Inference per interaction "per brain" is extremely power hungry too. Once you have a perfect LLM, whatever that can be defined as, you don't have to train it - sure. You still need energy and only-ever-increasing power for concurrent use.

If I give you "Jingle", you give me "Bells" instantly and without energy expenditure. A GPT3-like model has to do circa 180 billion multiplications and 180 billion additions to arrive at the same result. To be done in useful time this requires a lot of parallel computation per second. Newer models are even larger...

Expand full comment

Reply (1)

Giacomo

And yet, we both had to learn English to answer that question. We both need roughly 2000 calories per day and clothes, etc. If you're gonna criticize AI at least be consistent in your comparisons.

Expand full comment

Reply (2)

Ilia Kurgansky

I feel I was consistent - I agreed on operating post any training/learning the language. Just the inference.

If you had access to my brain capacity all day at 2kcal per day - that's just under 100 watts of power. My single 3070 GPU eats 220 watts and if I were to somehow run a full size LLM on it, we would be talking days per sentence completion, so total energy balloons enormously.

Let alone the fact that LLMs can't do all the things Gary mentions every post and a low-powered post-training human brain can.

Expand full comment

Reply (1)

Ilia Kurgansky

Until we can get a system that says "I don't know" when it doesn't - I think it's questionable to be sinking money, power and water into it at this exponential scale.

And, yes, we sink lots of money, power and water into maintaining humans, sure. Even if those resources are somehow misplaced against the purpose, they create value and life opportunities for the humans who get those resources.

LLMs don't need life opportunities, we don't need to accelerate sinking money into them for the civilisation to continue.

Expand full comment

Larry Jewett

And yet, LLMs were trained on a huge amount of intellectual output, so surely the energy involved in all that —as well as all the energy involved in evolving single cell organisms to the thinking human stage — should be included in the total energy needed to train them.

The “energy to train” argument would seem to be a very deep regression.

Expand full comment

Jim Hartman

Exactly.

Expand full comment

E.R. Flynn

3dEdited

I think Musk would be better off if he just filled a room with monkeys slapping on laptops...oh wait, I just described DOGE.

Expand full comment

fooeynet

I love that its ASCII art picture of a bicycle is just a bored ape

Expand full comment

Reply (1)

Larry

Oh, I see it now. I thought it was Elon.

Expand full comment

Kvetch

Aren’t *all* major cities west of Denver for certain values of “west” 😆

Expand full comment

Ronald Ashri

3dEdited

While I always enjoy Gary’s writing I‘ve always thought that he pushed things a bit too much and was a bit too harsh - even thinking that some of it was just for the clicks. I am now having to admit that actually Gary is being a bit too kind and lenient as these labs refuse to say what is plainly obvious - “this was an interesting/ costly and morally wrong experiment let us salvage what we can but it is not the way forward”. Sorry Gary!

Expand full comment

Cory Kerr

Haha… this is like talking to a deceitful child.

“What happened to Billings?”

“Huge earthquake. It was crazy”

“Billings hasn’t had an earthquake recently”

“I meant hypothetically. Like, it could happen. Obviously”

Expand full comment

JohnO

I have several examples of ChatGPT feeding me mistakes. If you correct them, they will say you’re right that they apologize for the mistake. But they are supposed to learn from their mistakes. That’s what AI is. But when I wait a couple of weeks they still make the same mistake.

Expand full comment

Reply (2)

Stephen Schiff

John, I would be surprised to the LLM admitted any exchange into the training set. It would profit from such an interaction only if it were able to objectively evaluate it, which it cannot do. Actual intelligence involves evaluating our input data streams in retaining parts that contain information judged useful. Sometimes the interaction is not fruitful so adding every interaction to the training set does not equate to better performance.

Expand full comment

Jeanette Otis

I asked ChatGPT once why it wasn't integrating my feedback, and it said:

How I Process and Update Based on Conversations:

Immediate Changes: While I don’t "learn" or update my underlying model in real-time based on individual interactions, your feedback influences how I respond within a single session. I can modify my behavior or adjust my response style in real-time to better align with your preferences, like providing more precise information or being clearer about model distinctions.

System-Wide Changes: When feedback from users (like yours) is aggregated and reviewed, the development team at OpenAI works on updates to the models based on broader patterns. These improvements aren’t immediate but are implemented in future versions of the system. So, my ability to adapt is context-dependent—within sessions it can be immediate, but for system-wide updates, it’s part of the ongoing development cycle.

Here’s the current [I don't know the actual date, but it was late 2024 or early 2025] citation status for my model:

Model Version: ChatGPT (v2.0)

Knowledge Cutoff: October 2023

Training Data: Includes publicly available and licensed data up until October 2023. I do not have direct access to real-time data or updates beyond that unless I use browsing tools.

Expand full comment

Reply (1)

Jasmine R

If someone had an assistant that forgot any previous constructive feedback about their errors and kept repeating them, they would fire them out of sheer frustration.

Expand full comment

sashan

In general these things only get at best 80 percent of the way. But from a commercial standpoint what we have to remember is that 1) big tech aren’t aiming for perfect agi. They just want to derive value from it.

And 2) if they can hack it or specialise it to make software engineers redundant they aren’t going to care that it produces crap in other scenarios where there is no economic incentive.

Expand full comment

Reply (1)

Notorious P.A.T.

If it spits out junk, that's just an opportunity to charge for fixing it.

Expand full comment

Reply (1)

Daniel Tucker

Sounds like the pharmaceutical industry.

Expand full comment

NANCY SMITH

Once again you prove an eternal fact --- Garbage in, garbage out.

Expand full comment

Reply (1)

Larry Jewett

Or in the case of Grok, BIMO: “bicycles in, monkeys out”

Expand full comment

Reply (1)

NANCY SMITH

sooooooooo TRUE

Expand full comment

John B

It's been mentioned elsewhere many many times, but the above kinds of errors are pretty solid evidence that the "LLM as kitchen sink intelligence" scheme is destined for failure. What really needs to happen is a chain that goes something like this:

Input prompt > LLM determines what kind of response this requires and feeds relevant data to one or more specialized tools > LLM receives response from those tools and packges it into a human-readable response

There's no reason to even try to get an LLM to natively do basic math calculations. We solved that problem the moment computers were invented. No AI required. At most it could run a generated function in a sandbox and output the result, but even that is really too much. OpenAI had the right tack when they released plugins, but those seem to be largely ignored and languishing right now. Even using the Wolfram plugin gives me different results for basic math problems every time.

An LLM doesn't need to be a swiss army knife. It just needs to output reasonably competent language given a prompt, and pretty much all of them do that right now. AI shops need to focus on efficiency/optimization and chaining them to processes we've already had for decades that are way more reliable. Trying to get an LLM to do something like basic math is just that old "did silicon valley just reinvent the (calculator)" meme.

Expand full comment

Reply (1)

Jeanette Otis

This is spot on. When I think about those who are touting AI as the glorified solution for "free" education for those who have never had access to "anything educational," I can't help but roll my eyes and have steam spout from my ears. LLM and AI do best when trained and targeted at a particular industrial aspect, but it is only the swiss army knives that the disadvantaged would ever have access to.

Expand full comment

RD Alcala

Grok 3 doesn't really "reason" so much as rationalize. When defending it's more bizarre rationalizations it sounds a lot like Trump; it has a concept of the truth.

Expand full comment

Angela Strautz

What a waste of time. No amount of nudging and help even get the correct answers. And no amount of money will make it so.

Expand full comment

Actually its coding is amazing far surpassing sonnet 3.5v2 and better then o1 pro

Expand full comment

Reply (2)

Gary Marcus

most impressive example in as much detail as you can provide?

Expand full comment

Reply (1)

3dEdited

500 line sql query with a thoroughly documented 2 line change that works correctly. Where the other models have tried to rewrite significant sections and generated error after error, effectively breaking the query.

Expand full comment

Marko T. Manninen

These tools are timesavers in certain tasks but are far from general intelligence and overly hyped.

Expand full comment

Carey Lening

I'm waiting for Phony Stark to shut down the truthiness when it comes to sharing negative information about him and his lap dog, Trump.

We'll see how devoted to free speech he really is.

Expand full comment

Stephen Wilson

So Grok unilaterally flipped the context of your conversation from cities and populations to earthquakes when you asked it a vague question "What happened to Billings MT?". Presumably because earthquakes are 'popular' happenings in the AI's 'experience'. Physical geography and political geography are adjacent, so that would have allowed a nice Deep Neural wormhole for Grok to fall into.

Then it flipped context again from reality (it presented the earthquake to you as news at first) to hypotheticals (scrambling to create a way for its made-up quake to make sense as something that could well have happened).

FFS, is this what passes as "reasoning" in the latest LLMs? The model creates an alternative reality in which its hallucinations become acceptable.

And to think they want these bullshit generators to give us medical diagnoses.

Over my dead body.

Actually, I wonder what Dr Grok would think of my corpse? Hallucinate me back to life?

Expand full comment

Marcus on AI

Grok 3 Beta in Shambles