Did GoogleAI Just Snooker One of Silicon Valley’s Sharpest Minds?
Clever Hans and how corporate AI plays the media, in 2022
In 1904, the horse du jour was Clever Hans, widely reputed to be so much smarter than his brethren that he could do math, tell time, and even read and spell. Word spread fast by word of mouth, and eventually the occasionally gullible The New York Times reported that Hans was so smart that he “can do almost everything but talk”. Ask Hans what 12 plus 13 is, and he would stamp his feet 25 times. People were amazed, and paid good money to see him.
Turns out the horse knew no math; it had solved the arithmetic problems—all of them —in a different way. The horse watched its trainer; its trainer knew the math. Hans just stamped his feet until he could sense tension in his trainer’s face and posture. When trainer got nervous that the horse might think the answer was 26, the horse sensed it and stopped. For the record, the trainer hadn’t lied; he really thought his horse knew math. There was no deliberate attempt to deceive. But there was a will to believe, and that will led many people astray. (The psychologist Oskar Pfungst eventually debunked the whole affair).
Cognitive psychologist have never forgotten Clever Hans, and you shouldn’t either, because the whole tale has a important moral: you can’t judge a mind by its cover.
Strong claims need strong evidence. Without an Oskar Pfungst, or proper peer review, you can fool most of the people most of the time, but that doesn’t make it true.
Fast forward a century, and there is a new will to believe. People may no longer believe that horses can do math, but they do want to believe that a new kind of “artificial general intelligence [AGI]”, capable of doing math, understanding human language, and so much more, is here or nearly here. Elon Musk, for example, recently said that it was more likely than not that we would see AGI by 2029 . (I think he is so far I offered to bet him a $100,000 he was wrong; enough of my colleagues agreed with me that within hours they quintupled my bet, to $500,000. Musk didn’t have the guts to accept, which tells you a lot.)
But it’s not just Elon that wants you to believe that AI is nigh. Much of the corporate world wants you to believe the same.
Take Google. Throughout the 2010’s, Google (and by extension its parent, Alphabet) was by the far the biggest player in AI. They bought Geoffrey Hinton’s small startup, soon after Hinton and his students launched the deep learning revolution, and they bought the research powerhouse DeepMind, after DeepMind made an impressive demonstration of a single neural network architecture that could play many Atari games at superhuman levels. Hundreds of other researchers flocked there; one of the best-known minds in AI, Peter Norvig, had already been there for years. They made major advances in machine translation, and did in important work in AI for medicine. In 2017, a team of Google researchers invented the Transformer, arguably the biggest advance in AI in the last 10 years. A lot of good science was done there. But for a while, most of that stuff was relatively quiet; the field knew it was happening, but Google (and Alphabet) didn’t make a deal of it. I don’t even know if there was a public announcement when Transformers were released.
But somewhere along the way, Google changed its messaging around AI.
In January of 2018, Google’s CEO Sundar Pichai told Kara Swisher, in a big publicized interview, that “AI is one of the most important things humanity is working on. It is more profound than, I dunno, electricity or fire”. Never mind that cooked (and cured) food, electric lights, and the cell phones and computers that electricity made possible are all pretty important, ever since then Google’s portrayal of AI has never been the same.
Five months later, for example, Pichai announced with enormous fanfare a product called Google Duplex that made automatic phone calls that sounded like authentic human beings. Pundits raced to consider the implications. (Ernie Davis and I wondered aloud whether it would ever get past its very limited demo stage of calling hair salons and restaurants; we were right to be skeptical; so far, four and half years later, Duplex hasn’t gotten a whole lot further.)
The latest alleged triumph is that Google hinted in working paper for a new system called Imagen that they had made a key advance in one of the biggest outstanding problems in artificial intelligence: getting neural networks (the kind of AI that is currently popular) to understand compositionality—understanding how sentences are put together out of their parts.
The dirty secret in current AI is that for all the bloviating about how current systems have mastered human language, they are still really weak on compositionality, which most linguists (since Frege, over a century ago) would agree is at the very core of how language works.
A few months ago, I articulated this major liability, in an essay called Horse rides Astronaut, which traced out in detail DALL-E’s struggles with understanding the different between the sentence astronaut rides horse and the sentence horse rides astronaut.
There I was building on a long history, arguments that date back 35 years, to a famous critique of neural networks by Fodor and Pylyhsyhn, and my mentor Steven Pinker’s furtherance of those arguments.* (I myself re-raised the problem in 2001 in a book called The Algebraic Mind, showing that the mid 1980s addition of hidden layers to neural networks, thought at the time to be big deal, still couldn’t solve the problem.)
For me, compositionality has always been the touchstone for AI; when I say AI has hit a wall, it’s compositionality, more than anything else, that I have been talking about:
Most recently, a team lead Tristan Thrush and Douwe Kiela (both initially at Facebook, now at HuggingFace ) and Candace Ross (Facebook) studied compositionality more systematically, creating a useful benchmark called Winoground that is currently the state of the art. In their words
Winoground is a novel task and dataset [of 1,600 items] for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two images and two captions, the goal is to match them correctly—but crucially, both captions contain a completely identical set of words/morphemes, only in a different order.
To show evidence of compositionality on this task, a system needs for example, to tell the difference between some plants surrounding a lightbulb and a lightbulb surrounding some plants. This should be a piece of cake, but for currently popular systems, it’s not.
When Thrush and colleagues applied their benchmark to a set of recent models, the results were brutal: not one of the many models they tested did “much better than chance”. (Humans were at 90%)
But we all know how these things go; fans of neural networks are always pointing to the next big thing, racing to shot that this or that wall has been conquered. Word on the street is that Google’s latest, Imagen, has licked compositionality. Google would love that, “shock and awe” to frighten competitors out of the field, but, well …talk is cheap. Do they really have the goods?
A bona fide solution to compositionality in the context of systems that could learn from data on a massive scale, would certainly be big news; a real step forward in AI.
But many people have claimed over the years to solve the problem, and none of those systems have proven to be reliable; every proposed solution has been like Clever Hans, working in dim light, leveraging large databases to some degree, but falling apart upon careful inspection; they might get 60% on some task, but they never really master it. Notwithstanding the rumors about Google Imagen, nobody has yet publicly demonstrated a machine that can relate the meanings of sentences to their parts the way a five-year-old child can.
Because so much is at stake, it is important to trace out rumors. In that connection, I have repeatedly asked that Google give the scientific community access to Imagen.
They have refused even to respond. In a clear violation of modern scientific norms for sharing work in order to make sure it is reproducible, Google has failed to allow the scientific community to find out. They posted an article on Imagen, and hinted at improvements, but not so far as I know shared the model for peer review (ditto for LaMDA). Google cloaks its current work the trappings of science (article preprints with citations and graphs and so forth) but when it comes to community inspection, they’ve systematically opted out.
Winoground’s lead author Tristan Thrush has echoed my concerns, publicly offering to help Google if they have any issues in testing Imagen on Winoground. But Trush has yet to hear back from them - after almost four months.
It would be easy for Google to try, with a Google Colab notebook already set up for them and ready to go, so the silence is telling. (Dare I make obvious guess? They probably tried and failed, and don’t want to admit it.)
With all that in the background, I was appalled to see the stunt that GoogleAI (nominally a research oriented organization rather than a PR arm) just pulled.
Instead of allowing skeptical scientists like me or Thrush and Ross peer inside, they followed a playbook that is old as the hills: they granted exclusive access to a friendly journalist and got him excited.
In this case, Google hooked one of the biggest fish of them all, one of Silicon Valley’s best known writer/thinkers, Scott Alexander of Slate Star Codex, which the CEO of OpenAI has described as “essential reading among “the people inventing the future” in the tech industry.” (Full disclosure, I read Alexander’s successor Slate Star Codex, Astral Codex Ten, myself, and often enjoy it…when, that is, he is not covering artificial intelligence, about which we have had some rather public disagreements.)
In time, it’s become clear that Alexander is quite the optimist about AI, to the point of making bets on the speed of its success. On Monday, he published the results of a brief peek inside Imagen, and declared victory. As far as Alexander was concerned, Google had made major progress towards compositionality.
As smart as Alexander is, though, he is not a trained scientist, and it shows.
To begin with, instead of, say, reading the literature (which he often does quite thoroughly in other domains), and coming upon Thrush et al’s carefully constructed benchmark, or asking an expert who could have pointed him to that task, he homebrewed his own five question exam, charitably scoring success on any given question if the system got a scene correctly at least once in out ten tries—stacking the deck from square one. (No junior high-school student ever has gotten ten tries on every question in a exam.)
Even if you squint your eyes and ignore the heavy one-in-ten charity that Alexander applies, the scientific and statistical problems with this, particularly in comparison to Winoground, are myriad:
• Small samples are never a good idea. If you flip a penny 5 times and get five heads, you can’t decisively conclude that the penny is rigged. Winoground included 1600 items for a reason.
• Small samples in the absence of measures of statistical significance are a worse idea. If you flip a penny 5 times and get 5 heads, you need to calculate that the chance of getting that particular outcome is 1 in 32. If you conduct the experiment often enough, you’re going to get that, but it doesn’t mean that much. If you get 3/5 as Alexander did, when he prematurely declared victory, you don’t have much evidence of anything at all. (The Winoground crew by contrast was quite specific about how they did their statistics.)
• When you run an experiment a bunch of times (flipping five different pennies five times each) you raise your risk of making a Type I error, falsely concluding something is happening when it isn’t. Alexander ran five different models on his five-question survey, and crowed about the one result that came closest to supporting his hypothesis. Stats 101 will teach you that that’s a bad idea (my mentor called it a post hoc maximum); every statistician knows that things like Bonferroni corrections can help, but Alexander didn’t bother. Anyone with serious training in statistics would be deeply concerned. Then again, Alexander didn’t bother inferential statistics at all, which is a strict no-no.
• Using a homegrown benchmark of a few examples and drawing big conclusions is a further no no, a fallacy of overgeneralization (technically known as the fallacy of composition). Most Phase I medical trials don’t make it through Phase 3, and most scientific claims are of little value until they receive further validation. Neural networks are particularly prone to this, because they often get some subset of examples correct through memorization or what we might call nearby-generalization, but often fail when forced to generalization further away. (The technical term for this is distribution shift). Google’s Minerva, for example, can generalize what it knows about arithmetic to two digit multiplication problems, but fails altogether at multiplication problems that it hasn’t seen before that involve four digit times four digits (imagine a pocket calculator that was equally lousy). Despite its success on small problems, Minerva has no abstract idea what multiplication is. Looking only at the smaller math problems, one could be confused. In a similar way, there is no way to draw any kind of firm conclusion about what the system might know about compositionality from success a tiny sample.
• Declaring victory based on a handful of examples while ignoring more comprehensive tests(like Winoground is even worse). When Alexander is not talking about AI, he often prides himself on testing the “steelman” (strongest version) of any hypothesis possible. Here he tested the weakest version of the compositionality hypothesis imaginable (a bet placed by someone known only as Vitor, with no immediately apparent scientific credentials) and smugly declared victory (titling his essay “I Won My Three Year AI Progress Bet In Three Months”).
The progress that Alexander is vaunting shows that the system did better on a handful of items than DALL-E2 (which was not even state of the art at compositionality when it appeared), but not that the newer system (Imagen) systematically groks compositionality, or even that it might not have done worse on some of items on that the earlier system succeeded on (none of which Alexander tested, introducing a high risk of bias).
Science isn’t about declaring victory, it’s about putting in the work to rule out alternative hypotheses, and Alexander simply hasn’t done that.
Having a Silicon Valley celebrity like Alexander weigh in for Google is great PR for Google, but not so great for the public understanding of AI.
A couple of months ago, not long after Imagen came out, I co-organized a two-day workshop on compositionality with a bunch of heavyweight researchers from places like NYU, Johns Hopkins, Columbia, DeepMind, Microsoft and Meta.
There was lots of spirited disagreement. My co-organizer Raphaël Millière is decidely more bullish than I am about the potential of current approaches; Paul Smolensky, who argued with Fodor and Pylsyhyn in a historic 1988 debate was there too, in part to counterbalance me. But in the end Smolensky and I agreed on far more than we disagreed, including the fact that the problem of compositionality (a) remains critical and (b) still isn’t solved. (Imagen was already out by the time of the workshop, and swayed nobody in the know.)
Five items in a Scott Alexander blog and a press release from Google won’t change that. Alexander may have won his bet with the pseuodonymous Vitor, and Google may have gotten the press they want, but the field still has a giant problem to wrestle with.
If we forget the lessons of Clever Hans, we do so at considerable peril.
Yesterday, as part of a new podcast that will launch in the Spring, I interviewed the brilliant Harvard Medical researcher Isaac Kohane, who has the rare combination of both a PhD in computer science and a medical degree. Kohane is bullish about the long-term future of AI in medicine but also concerned that today’s hype about AI could lead to a cycle of unrealistic short-term expectations. Kohane recalled the “AI Winter” of the 1980s, in which funding abruptly dropped after expectations were not met, and fears we could wind up with another, similar crash if we are not realistic about what is and is not feasible.
If another AI winter does comes, it not be because AI is impossible, but because AI hype exceeds reality. The only cure for that is truth in advertising.
A will to believe in AI will never replace the need for careful science. Or, as Bertrand Russell once put it, “What is wanted is not the will to believe, but the wish to find out, which is its exact opposite.”
Update: Based on data and analysis by Edwin Chen, Scott Alexander has said that the bet he made with Vitor remains open, writing “I retract my claim to have won and will continue to see how AI progress advances over the next three years.“
Thanks for reading The Road to AI We Can Trust! Subscribe for free to receive new posts.
postscript: Scott Alexander replies to the above in his newsletter:
6: Gary Marcus has a response to my recent AI bet. I want to make it clear that whatever the merits of my bet or his arguments, Google did not “snooker” me. They had no part in this: I went around begging for someone to run my prompts through PARTI and Imagen, one of their employees asked their bosses’ permission and then agreed to do so, and ran them exactly as I asked. Any fault is entirely mine. I’m insisting on this pretty hard because I’m grateful that Google will sometimes respond to random requests by amateurs, and accusing them of deliberate deception in response burns their willingness to do that. As for everything else: I wrote “without wanting to claim that Imagen has fully mastered compositionality, I think it represents a significant enough improvement to win the bet, and to provide some evidence that simple scaling and normal progress are enough for compositionality gains”, I stick to the “some evidence” claim, I feel like I was pretty open about exactly how much/little evidence it was (Google sent me ten examples per prompt, I showed you four representative ones, but the extra six don’t change much). I agree Marcus makes some useful common sense claims on how sure to be after five examples.
In a tweet that appeared right after this was posted, Scott Alexander clarifies, “I wasn’t officially given access, a reader who worked there just asked if they could try it for me and Google said yes.” Needless to say, they did not grant me or Tristan Thrush et al the same courtesy.