125 Comments

Really significant finding. purely data driven models will never solve these root challenges

Expand full comment
Mar 5Liked by Gary Marcus

There was a book out there, weapons of math destruction which documented all this for the forerunners of LLMs. It's just data+statistics, it comes back in all kinds of disguises, the paradigm is the same.

Expand full comment
Mar 5Liked by Gary Marcus

Humans are perfectly capable of recognising that something that is statistically overrepresented from *our* perspective doesn't necessarily describe the statistical reality of the world at large. Of course, we fail at this often, but as this paper shows, LLMs seem to fail at this more often than not, because their entire conception of the "world at large" is derived from the statistical picture presented to them by the data they're trained on.

This paper gives excellent proof that deferring decision making to LLMs is a abrogation of responsibility. It should not be done.

And yes, I agree, makers of these LLMs should withdraw their products, or clearly label that far from "almost AGI", these products posses nothing of the kind of intelligence their marketing would have you believe.

Expand full comment
Mar 5Liked by Gary Marcus

I can't wait to watch Andrew Ng try to talk himself in circles on this one. I mean I recall a 2023 paper that found that ChatGPT reproduces societal biases. I agree with you that this cannot stand and that we as the general public deserve better.

Expand full comment

I'm not shocked. It's a language model after all. Chalk up one more limitation to people not understanding how bias in AI and ML actually has three distinct layers.

1. Cultural Bias

2. Data Bias

3. Algorithmic Bias... yes, AI is by by default a bias we place on the other two.

https://www.polymathicbeing.com/p/eliminating-bias-in-aiml

Expand full comment

I'm very happy that there are smart people doing this important research, however, I can't help but think that results like this should be completely unsurprising. The models are simply reflecting the associations found in the training data. I assume similar unwanted correlations are found in many other areas. Try the same approach for questions that use language more associated with the questioner's sex, religion, or national origin, just to name a few. It seems like band-aid solutions could be found for each of these issues (maybe preprocess certain questions to preserve meaning but remove dialect that implies race, sex, etc.), but how far down does this problem go? It's just so very easy to believe that any imposed solution that tries to filter out this "bias" is fighting a losing battle against the very data the models are built on.

Expand full comment
Mar 5Liked by Gary Marcus

I don't think these problems can be solved with the current technology. We need systems with much more large scale feedback. And probably multiple independent systems which can evaluate competing priotities and considerations, including symbolic AI systems that can provide hard checks against reality, something completely inaccessible to LLMs. All they have is probabilities and what people say and write, a far cry from reality.

Expand full comment
Mar 5Liked by Gary Marcus

This is, actually quite shocking.

Expand full comment
Mar 6Liked by Gary Marcus

This can use more exposure. I've added my own (with a thanks to you). https://ea.rna.nl/2024/03/07/aint-no-lie-the-unsolvable-prejudice-problem-in-chatgpt-and-friends/

Expand full comment
Mar 6Liked by Gary Marcus

If you aren’t familiar, please look for Erin Reddick. She is the founder of ChatBlackGPT which is currently in Beta.

Here is her LinkedIn

https://www.linkedin.com/in/erinreddick?utm_source=share&utm_campaign=share_via&utm_content=profile&utm_medium=ios_app

Expand full comment

Holy cow! That's awful and there is no fix for it within the current LLM tech

Expand full comment
Mar 5Liked by Gary Marcus

LLM by its very nature cannot be the foundation of any self-respecting production system that requires reliability and transparency. Recall would not be an issue if it was constrained to be a curiosity item for poem writing and such. Outside of that recall is quite appropriate to prevent harm. There is no shame in claiming one's core product is not LLM based, in fact, it should be an honor.

Expand full comment
Mar 5Liked by Gary Marcus

I'll sign, of course.

Gary, It's most pleasing to read a takedown without lip service to AGI or super AI or the like. Thank you. Even lip service feeds the enemy.

Expand full comment
Mar 5·edited Mar 5Liked by Gary Marcus

This has been a recurring theme for a while. The book I read on the topic was published in 2016 (Cathy O'Neil's "Weapons of Math Destruction"), and not much has changed since then. Gemini's infamous "Woke AI" was really just an ugly plaster over a very real hole in the load-bearing walls of LLMs.

Personally, I can live with this. The models are quite useful if you bear these weaknesses in mind and account for them... but too many people don't realize this is going to be a problem that won't be solved at scale.

Expand full comment
Mar 6Liked by Gary Marcus

"Don't become a statistic" as the okd saying goes.. speaks to the injustice around simply looking at statistics to make a decision that required a more thorough stakeholder analysis. How about a new saying for decision makers and app developers: "Don't become a stochastic parrot"!

Expand full comment

What is missing from the analysis is how other forms of non-standard American English is treated. I don't think English as spoken by much of the country will fare much better. People don't create written text in the same manner as they speak. Even so, people with more "prestigious" careers generally have had to learn to write in correct Standard American English as part of their education process (I've had science instructors who had English minors ensuring that), with the result that correct English is statistically associated with those careers.

Given the assumption that African American English is more of a spoken dialect than written (show me the English courses requiring students to write in AAE), a better comparison would be with other spoken dialects, e.g., Northeast, Midwest, South, or more particular groupings.

Expand full comment