Marcus on AI

Meanwhile, everybody's favorite AI developer, Yann LeCun, is calling for every repository of human knowledge to give AI companies free access to their treasure troves. Something sickening about the richest and most powerful people building their tools for mass unemployment off the stolen creative work of humanity. They could go slower in development, and only use training data they licensed. They don't want to. Stealing is cheaper, and Silicon Valley is packed with people who see that theft as fine. If fact, they call it "FREEING THE DATA!" As if the novels and articles I WROTE somehow want to jump into their shady LLMs so that I have even less chance to get paid to write in the future. It is IP theft on a scale unprecedented in human history. It SHOULD shock the conscience. It appears the first rule of going to work for a big AI company is a conscience-ectomy. Those are the people building our brave new world.

Expand full comment

Reply (3)

Terry Bollinger

This is not the Yann LeCun I knew. That Yann LeCun was instrumental in creating a new technology that converted an internet over raw data into an internet of known objects. The value of that work was real, tangible, and electrifying. No one had to sell its value by heavy-handed marketing

That was also a time when every major effort I knew to create databases of human knowledge began with Wikipedia and strictly emphasized privacy and respect for intellectual property.

Times and people change, not always for the better.

Expand full comment

Larry Jewett

The problem extends well beyond the relatively small number people working for LLM companies.

The fact that so few computer scientists even acknowledge — to say nothing of speak out on — the intellectual property issue* makes the situation exceedingly difficult for the ones like Suchir Balaji and Gary Marcus who do.

* and other issues like racial and other biases associated with the use of AI in general.

Expand full comment

Joie

Jun 22

I am a Computer Scientist, and I have written a thesis on this very thing.

Expand full comment

This has been going on for decades. This A.I thing is the second go round. This went on in the zeroes with what is called social media. No copyrights were paid. So I am saying why are people shocked! This has been consistent behavior that just choose not to see.

Expand full comment

I have been saying this for two plus decades when it comes to Silicon Valley. They have been involved in this criminal activity with the co sign wink wink of the so called law makers. Nothing new here. I am just glad folks are finally acknowledging that much of this innovation is based on fraudulent acts of crime.

Expand full comment

Amy A

With money as speech, the lawmakers who might oppose this have an impossible time getting elected.

Expand full comment

Thank you for that Amy. I do not see the lawmakers as the solution now because the system is so co opted. The system would have to collapse. Also what are called lawmakers are not the same as before. The people will have to stand up and say no! Now I know that may sound far out but that is the only way.

Expand full comment

Amy A

I have small children so system collapse is not an attractive solution. Real people taking civic action so that we can achieve change without destroying the good things that we take for granted would be my preference.

Expand full comment

Hi Amy, I understand why you say that due to small children. I just do not see no other way at this point. We all will have to get dirty and uncomfortable. We have had a comfortable life here in America. America is going to have to be uncomfortable because convenience and comfort has got us here. I am also stating society now has been altered and not for the best. Peoples mindsets have been adjusted but for the best. The American diet must change which is enhancing the mental issues people possess along with what is called a healthcare system which is a poison based system to keep people sickly. That is why Things must comedown. This system has caused so much problems due to our comforts.

Expand full comment

Joie

Jun 22

I have children too and have accidentally become the next Suchir.

Expand full comment

Ghatanathoah

Not only is AI bad at copyright, it also doesn't seem to understand that the Avengers are Marvel and Superman is DC.

Expand full comment

That is disrespectful! For them it is just content not media. It is free not sweat and toiling to create.

Expand full comment

Richard Self

LLM based AI knows nothing!

It is truly a dumb pattern follower, not an intelligence.

Expand full comment

Larry Jewett

Dec 16, 2024

It AIn’telligence, that’s for sure

Expand full comment

Joie

Jun 22

It thrives off of pattern following.

Expand full comment

Chaos Goblin

I'm always suspicious when a suicide conveniently allows the plundering class to keep on keeping on. Aaron Swartz, *TWO* Boeing whistleblowers, and Suchir Balaji.

In addition to Art Keller's comparison with Yann LeCun, these people want everything "democratized" so they can train on the entire "input" of humanity, but they are absolutely adamant that you will pay for the "output" whether it's with money or data. Them's the rules because they paid good money for lawmakers and judges to make them that way.

Expand full comment

Dec 15, 2024Edited

Yes, these "suicides" are awfully convenient, aren't they? In each and every case, potentially billions of dollars at stake.

Expand full comment

Glen

Dec 16, 2024

Most conspiracy theories fall apart when you realize how dumb and disorganized most people are. I can certainly imagine some of these tech CEOs wanting to see their critics dead, but them being able to actually do a murder for hire plot (pretty sure tech CEOs will not be doing their own killing) and not get found out?

No. I cannot see that happening.

It's much easier to believe the robber barons made the lives of the whistleblowers so completely hellish by using the legal system as an instrument of torture that death seemed preferable.

Expand full comment

Claude Coulombe

Dec 20, 2024Edited

Suicide is possible, but in Balaji's case, there are messages on X attributed to his parents that contradict the suicide hypothesis. Of course, this needs to be verified, as they could be fake, but that's the police job. You can check the account https://x.com/RaoPoornima/ on X.

Given the large sums of money involved, it's possible organized crime might be involved. This could be a motive for foul play, perhaps to protect investments or for future gains. It sounds like the plot of a thriller movie!

That said, the Earth is round, vaccines are effective, climate change is real and caused by humans activities, and sometimes, conspiracies do exist.

Expand full comment

Dec 17, 2024

It's true, most conspiracy theories are nonense. Most. But not all. Kennedy, Watergate, the Bay of Pigs, Iran/Contra etc. were all actual conspiracies that were hidden—or attempted to be hidden—from the public at large, and only partially unentangled. Same with Condon Report. So there are bonafide conspiracies that do occur. All these recent whistleblowers ending up dead, when billions are at stake, is highly suspicious and a level of coincidence that, frankly, belies belief. One of the Boeing whistleblowers actually made a point of formally stating to family and friends that he was NOT in any way suicidal and that if he ended up dead it was not by his own hand and an investigation should be launched.

Expand full comment

Glen

Dec 18, 2024

Examine your thinking. You are trying to create patterns without sufficient evidence. It's desirable to have everything make sense, that people should not be so miserable after doing the right thing that they kill themselves, but it happens. Good people kill themselves all the time.

Spinning conspiracies out of nothing is bad. It whips up mobs into doing tragic things like reject use of vaccines.

Expand full comment

Dec 18, 2024

Let's examine *your* thinking for a moment, which appears to be a bit uncritical, frankly. Yes, it is true that conspiracy theories more often than not turn out to be false. Or as Niccolo Machiavelli said (and he would know), "Always assume incompetence before looking for conspiracy." With that being said, it's not about what's theoretical. It's not even about Occam's Razor. You have to carefully gather and examine all the data. So while it's true to say that most conspiracies turn out to be false, most is not all (as I indicated citing examples above). Just because you believe something "cannot be," does not *automatically* mean that it "isn't."

Expand full comment

macirish

Fair use is a defense, not a right. If you use copyrighted material and make a lot of money, you're going to get sued - even if you us just a little.

Expand full comment

Glen

Dec 28, 2024

Especially if you use the copyrighted material owned by large media conglomerates. They've already had to pay out an incredible sum to the New York Times. If it turns out there's a bunch of Disney and/or Time Warner material that's been fed into those algorithms they're done for. The bubble will pop the instant either of those media empires dispatches their lawyers to crush them.

Expand full comment

macirish

Dec 28, 2024

Item of interest. Have a friend who was preparing an article on US Presidents. He had Grok generate images for the article - in most cases he got a good image, but in 3 of the presidents he couldn't get an accurate image.

LLMs can't tell you what data they were trained on, but it appears that if there are enough labeled images, it can get pretty close.

Tried to get an image of Mickey Mouse, it refused based on copyright - but it offered this: "...I can create an image of a character inspired by Mickey Mouse...". I'll bet Mickey is somewhere in the trailers.

Yes, it did pretty well.

Expand full comment

Amy A

Condolences to his family and friends.

There is clear evidence that the makers of these tools know that their fair use argument is weak, most obviously in that they have tried to hide their training data. Values and principles just don’t exist, it seems.

Expand full comment

hugh

I worry that the US courts are moving too slow here. With Sam Altman literally paying homage to Trump post election it feels like OpenAI and Microsoft will be ramping up the lobbying in DC next year to carve out an explicit fair use exception for pre training. And know their anticompetitive tendencies I’m sure they’ll bake the law in a way that only the top labs get thar fair use exception.

Expand full comment

Is this one of those "suicides" like when a person shoots themselves in the back of the head? Sort of like the Boeing whistleblower "suicides" from earlier this year?

Expand full comment

Costa

Dec 15, 2024Edited

Yep, like the suicide with 2 holes in the back of the head...

Expand full comment

Mike Emeigh

On the four-factor test: I’m not a lawyer, but I don’t think it’s at all clear-cut that the use of copyrighted material in training LLMs is a violation of copyright, even when portions of the copyrighted material are reproduced verbatim in the output.

The questions that still have to be answered in the courts are:

Is the purpose and character of the use (factor 1) primarily of a commercial nature, or for nonprofit educational purposes? Here OpenAI’s unique business structure may actually work in its favor (and indeed may very well have been a factor when it was set up that way).

What is “the amount and substantiality of the portion used in relation to the copyrighted work as a whole” (factor 3)? In Campbell vs Acuff-Rose Music, the Supreme Court on remand required the lower courts to consider the “transformative elements” of the derivative work, and “transformation” has since become a key part of the legal cases when the third factor is invoked. Just because a substantial portion of the output from a copyrighted work is duplicated does not necessary mean that the copyrighted work hasn’t been transformed, under law.

What is the effect of the derivative work on the potential market for, or value of, the copyrighted work (factor 4)? Is the value of, say, the Mario Bros. reduced by the use of DALL-E to generate an image that’s used in a different context? The courts have already dismissed or greatly reduced the ability of plaintiffs claiming copyright violations by requiring them to show that they have been harmed by the availability of their copyrighted works inside of an AI tool, and the courts do seem at least initially inclined to a position that there is no “market substitution” in the way that copyrighted works are used within AI.

Expand full comment

Ben P

Dec 17, 2024

I'm also skeptical that current copyright law is up for this task, but like you I have no legal expertise. Seems to me that "training an AI" is a use case that simply has no good analogues in the world that existed before GenAI.

I hope I'm wrong, because as far as I'm concered this is flagrant theft, and they're getting away with it because most people don't understand that the kind of "learning" LLMs to is radically different from the "learning" humans do. The tech companies say "hey, if people are allowed to learn from copyrighted material how to be creative, AIs should too", which is both specious and compelling to a lay person.

We need new legislation on this. I'm not optimistic congress is up for it. My one hope is that the traditional IP holding corporations (Disney, Sony...) hold more political sway than the tech companies.

Expand full comment

Jasmine R

Dec 22, 2024

OpenAI is transitioning towards a for profit structure, though. I don't know that telling a judge you were a non-profit when the alleged infringement first occurred will hold much weight, but I'm not a lawyer.

Expand full comment

Martin Rodgers

What many people miss in these debates is how asynine popular culture ergo culture in the Western world has become since the dawn of the Internet age, accelerated by the social media age. Spotify created a culture of commodification of music. It's the John the Baptist to the anti-Jesus of fake tunes. It's Christmas 2024 in the privileged Western world. Look at the music charts : Mariah, Wham, The Pogues, Sinatra. Big tech destroys culture

Expand full comment

Ohlfearnain

Dec 19, 2024

Right on, you are!

Expand full comment

Gerard

https://ai-cosmos.hashnode.dev/the-great-opacity-why-ai-labs-need-to-come-clean-about-their-data

AI training data issues go beyond copyright—they also involve privacy. AI labs often scrape any data they can access, using web crawlers, obscure opt-out terms of service, and potentially other questionable methods, with little to no concern for ethical considerations. This serious issue has been ignored from the beginning, with no regulatory framework or oversight in place.

As the most visible AI lab, OpenAI has a responsibility to set a positive example—a responsibility they have largely failed to meet. The glaring lack of transparency and accountability in the industry cannot be overlooked. Even simple processes like reverse search or addressing copyright reveal significant flaws in data governance and lifecycle management.

In this deep dive, I cover the main issues and debunk common counterarguments.

Expand full comment

Jim Amos