Discussion about this post

User's avatar
Gerben Wierda's avatar

While these are all great numbers from the Chinese companies, and that probably means usable and affordable models for certain tasks will be possible, I am going to be careful to draw conclusions before there is more than just some *benchmark* numbers blogs and such. And *selected* benchmarks at that.

The tasks are very specific (i.e. math) and they are benchmarks. Do we know the test data wasn't part of the training data? And even if we know it wasn't *exactly*, might we be running into situations where the *variability* of the test and training data is becoming the issue?

Take DeepSeek V3 (37B parameters, right, what size?). It does 90.2% on MATH-500, but that is a number with (EM) attached, so "exact match" with the reference answer. Ouch. That is a warning sign for me. The other two math benchmarks next to it are 39.2% and 43.2%, both Pass@1. So, does it do well on math?

The vibe I am a little bit getting is the race for benchmark numbers in the 1990s (first MIPS was the big thing, then FLOPS).

And then, additionally, we all are going to use DeepSeek and other Chinese models and all our input and replies ends up in trustworthy hands, right? And we're going to trust what comes out of that when it's something other than math, right?

Expand full comment
Youssef alHoutsefot's avatar

"Building $500B worth of power and data centers in the service of enormous collections of those chips isn't looking so sensible, either."

Exactly.

In fact, it's looking like a massive waste of money, time, and attention. In the style of absurd exaggeration and tech hubris we've come to expect.

I'm impressed with the name of the initiative. Stargate. Exactly the kind of semiliterate nonsense that I'd expect from our tech oligarchs.

Where and how were these guys educated?

Expand full comment
101 more comments...

No posts