What Google Should Really Be Worried About
How sewers of lies could spell the end of web search
[A brief postscript to Inside the Heart of ChatGPT’s Darkness, in which I elaborate on the Mayim Bialek hoax that I mentioned to Ezra Klein.]
For the last year, the TV star and Jeopardy co-host Mayim Bialek has been haunted by an utterly bogus hoax, a ring of fake websites alleging that she was selling CBD gummies.
Bialek is an undeniably busy person (she even has a PhD in Neuroscience from UCLA), but selling CBD gummies has never been part of her portfolio. Still the hoax gathered so much momentum, she eventually felt that she had to speak out, in March of 2022.
Soon thereafter, Business Insider wrote up the whole thing, and even Page Six covered—and debunked—the hoax too.
And yet eleven months later a bunch of these fake sites are still there. A couple samples I grabbed from the web, yesterday:
That sucks for Dr. Bialek, but honestly, it sucks for all of us; there should be something in the US, analogous to the European right to be forgotten, that should make it a major offense to say nonsense like this. But so far there isn’t.
§
So the sites are still there, even after being debunked. The question is why: why do rings of fakes websites like these even exist?
Part of the answer is, of course, money. Fake websites can be used to sell real advertisements.
Some of the sites lead for example to other sites that are selling CBD gummies (without Bialek’s participation or endorsement), others to all kinds of other garbage, ranging from phishing scams to potentially lucrative advertisements for things like gambling, cars, and travel. As I was putting this essay together, a popup briefly offered me some nonsense about a Mercedes (perhaps I had already won!) and then quickly diverted me to a site that offerred to give me a good deal on car insurance. (Just what I will need for my new car!) Still a third offered me airplane tickets to Montreal (reminding me I ought to clear my cookies a bit more often). Had I clicked on the ad, scammers would have made some money.
The second part of the answer is also about money, but it’s slightly more subtle: search engines reward sites that mutually reinforce each other. A single site for bogus gummies might not get much traction, but if there are a bunch of them? They can become a mutually supported ring (or more complex network), a veritable cesspool of lies.
By sheer numbers, and their patterns of cross linking, those cesspools can fool search engines into thinking that bogus websites are more legitimate than they are. Insiders call it spamdexing, or search engine poisoning, the black art of tricking the indexes that govern search engines into thinking that insignificant sites are more important than they really are. Content Farms create this stuff at scale, using human labor.
§
None of that is new, per se. But here’s the thing: the world’s latest and great tool for tricking search engines (hence playing games with SEO, aka Search Engine Optimization) may change everything.
I am speaking of course about Large Language Models, and the opportunities for industrialization they provide. Large Language Models (like GPT-3, ChatGPT, Galactica, etc) could be used to produce a enormous numbers of marginal blogs and fake reviews to pump each other up, and increase the search rankings of fraudulent garbage—with a minimal amount of human effort. Why manually write cesspools of interconnected likes, when an LLM can do that job for you?
LLMs aren’t very good at writing error free prose, but the people who run content farms won’t care about the errors. what used to be written laboriously by humans suddenly becomes cheap and hence more widespread.
§
Was GPT directly involved in the Bialek hoax? I don’t know for sure; most of the sites were made a year ago, when large language models weren’t quite as fluent as they are now. Some of the Bialek stories did have GPT-2 generated feel, which is to say they weren’t very fluent, and nor was GPT-2:
But of course we can’t know for sure. Then again, whether or not these particular fake websites were generated last year by LLMs is almost irrelevant. The Bialek hoax is probably just the tip of an iceberg.
What really matters is that the newest technology now supports writing stuff like this, at scale, and far more fluently than before.
And scammers gotta scam.
By some accounts, digital ad fraud is already a 60+ billion dollar a year industry. There is big business in fake crypto websites, and perhaps even bigger business in fake reviews. It would be irrational for scammers not to use these new tools to vastly expand their reach, and hence vastly expand their profits.
§
Back in December, I repeatedly asked Meta’s chief AI scientist Yann LeCun a simple question—did Meta have any data on the fraction of misinformation removed by Meta that was generated by large language model? (Or any information on whether that fraction was changing over time.) No matter many how times I asked, he never answered.
When the only company that might have significant amounts of data on the question won’t even answer the question, we can guess that there very well may be a problem.
§
As I have discussed in earlier essays, the jury is still out on how far Chat-style search can get, given its obvious current problems with truthiness. But either way, search itself seems likely to face a major growing threat from circles of fake websites that are designed to sell ads and mutually reinforce each other’s positions in Search Engine Optimization.
Cesspools of automatically-generated fake websites, rather than ChatGPT search, may ultimately come to be the single biggest threat that Google ever faces. After all, if users are left sifting through sewers full of useless misinformation, the value of search would go to zero—potentially killing the company.
For the company that invented Transformers—the major technical advance underlying the large language model revolution—that would be a strange irony indeed.
Gary Marcus (@garymarcus), scientist, bestselling author, and entrepreneur, is a skeptic about current AI but genuinely wants to see the best AI possible for the world—and still holds a tiny bit of optimism. Sign up to his Substack (free!), and listen to him on Ezra Klein. His most recent book, co-authored with Ernest Davis, Rebooting AI, is one of Forbes’s 7 Must Read Books in AI.
Hi Gary, so true - where there is easy money to be made (ethics be damned!), there will be (are) scammers - using LLMs as the perfect scamplifiers that they are (fast, cheap, good).
Additionally, if people opt for Bard to summarize their Google search, this might add a second layer of BS coating. Google searchers will therefore face an ugly dichotomy: manually wade through possibly BS results, or have it be served with possibly more BS!
Further - the BS will be scraped by next gen LLM producers, to get folded (baked?!) in.
None of this looks appealing to any search provider, or to us users.
I've seen many times on social media comments justifying the means or journey of misinformation to get to where AI should be.
We dont need to accept the cesspit of misinformation, toxicity and bias which are all due to lack of 'understanding' when there is better science. https://john-at-pat.medium.com/the-new-nlu-industry-a318c6e138d1