Remember good old Grok 3, all 200,000 GPUs worth, advertised by Elon Musk a few days ago as the “smartest AI on earth”, and demoed on livestream last night as “a maximally truth-seeking AI”?
I just took it for a spin. It got my first question right (a comparison between two decimals, 56.1012 vs 56.90) but things went rapidly downhill from there. Here’s a sample of some of the many problems I noted in a couple hours experimentation.
Told that images might be hard, I switched to ASCII art, allegedly easier:
Close, but no cigar. How about my favorite, bicycles?
Ok, let’s try that in “thinking mode”:
Not much better. How about today’s date? (Other people had better luck than I did on this one, in fairness, and in fact on a later try so did I. But today’s date should never be multiple guess.)
§
The most interesting part was playing with Deep Search which makes amazing-seeming reports that seem reasonably often to have very subtle, and difficult to detect errors.
Here’s a good example, I asked for a list of major cities west of Denver along with their populations, and it mostly obliged, with 100,000 as cutoff, but it left out Billings, Montana (2020 pop, 117,116). And chauvinistically (despite the fact that my dozen previous queries were all about Canadian provinces), every relevant city in Canada, including my adopted hometown of Vancouver BC (2021 pop 662,248), which I can report is most definitely west of Denver.1
Things got weirder from there, when I asked (a bit too tersely, perhaps), “what happened to Billings, Montana?” Grok failed to understand that I was talking about the table (my bad?). Instead, it told me about an earthquake that allegedly had happened there recently:
Surprised by this, I did some fact checking; it doesn’t look like Billings had any such earthquake in Billings, nor any recent quake of that magnitude.
So I pushed Deep Search both on the quake and why Billings had been left out.
On the matter of Billings being west of Denver and omitted from the list, Deep Search graciously (forgive anthropomorphism) conceding the point, blaming “oversight”.2
On the earthquake, though, it dug in deeper, as you will. Aside from the gaslighting, I was a bit surprised to learn that February 2025 was still in the future. From where I sit, it’s more than half in the past. But what I do know? I am just a human.
§
The primary lesson here is the same as in the early ChatGPT days: caveat emptor. No matter how good-looking the output, there are often subtle errors that most people wouldn’t catch.
The secondary lesson here is that the more excited people are about LLMs, the more I wonder how carefully they have examined the output.
Stepping back, the broadest observation is this. Grok 3 required training two new massive data centers operating full time for months, and 15x the compute of Grok 2 — yet all these kinds of errors feel awfully familiar.
If AI were (as it used to be, to some degree) a science, people would say: “hey, we put half a trillion dollars into testing the idea of scaling and massive compute and something’s still not right, maybe we should try something else?”
Instead, valuations just keep rising, results notwithstanding.
At least for now.
Update: Mathematician Daniel Litt reports a related series of hallucinations with OpenAI’s Deep Research, in this X thread.
Gary Marcus is baffled by the persistent attachment of the community and its investors to scaling uber alles despite abundant and repeated counterevidence of the same general form, over and over and over again. On the other hand, Gary loved the film Groundhog Day and retains hope thast someday things will get better.
Also west of Denver, if one were counting both Canada and Mexico are cities such as Saskatoon, Tijuana, Chihuaha, Puerto Vallarta, etc. A smarter system might have clarified.
Note that part — but not all — of Montana is west of the western edge of Colorado, but that Billings is entirely west of Denver, a finer point that Deep Search seemed to miss. Don’t know if that error stems from the geographical nuance..
Half of the voting population in this country will go to their graves convinced that we spent 50 million dollars on condoms for Hamas because it was said once by the right person. Doesn't matter how wrong it was or how many times it has been and will be corrected. We don't need to invent a deception machine in order to make everyone dumber, but we won't let that stop us. Prayers for Billings.
Intelligence is efficient. While we are talking of building nuclear reactors to power what is hoped to be the equivalent intelligence of a 12-20 watt human brain, we are on the wrong path.