Meanwhile, everybody's favorite AI developer, Yann LeCun, is calling for every repository of human knowledge to give AI companies free access to their treasure troves. Something sickening about the richest and most powerful people building their tools for mass unemployment off the stolen creative work of humanity. They could go slower in development, and only use training data they licensed. They don't want to. Stealing is cheaper, and Silicon Valley is packed with people who see that theft as fine. If fact, they call it "FREEING THE DATA!" As if the novels and articles I WROTE somehow want to jump into their shady LLMs so that I have even less chance to get paid to write in the future. It is IP theft on a scale unprecedented in human history. It SHOULD shock the conscience. It appears the first rule of going to work for a big AI company is a conscience-ectomy. Those are the people building our brave new world.
This is not the Yann LeCun I knew. That Yann LeCun was instrumental in creating a new technology that converted an internet over raw data into an internet of known objects. The value of that work was real, tangible, and electrifying. No one had to sell its value by heavy-handed marketing
That was also a time when every major effort I knew to create databases of human knowledge began with Wikipedia and strictly emphasized privacy and respect for intellectual property.
Times and people change, not always for the better.
The problem extends well beyond the relatively small number people working for LLM companies.
The fact that so few computer scientists even acknowledge — to say nothing of speak out on — the intellectual property issue* makes the situation exceedingly difficult for the ones like Suchir Balaji and Gary Marcus who do.
* and other issues like racial and other biases associated with the use of AI in general.
This has been going on for decades. This A.I thing is the second go round. This went on in the zeroes with what is called social media. No copyrights were paid. So I am saying why are people shocked! This has been consistent behavior that just choose not to see.
I have been saying this for two plus decades when it comes to Silicon Valley. They have been involved in this criminal activity with the co sign wink wink of the so called law makers. Nothing new here. I am just glad folks are finally acknowledging that much of this innovation is based on fraudulent acts of crime.
Thank you for that Amy. I do not see the lawmakers as the solution now because the system is so co opted. The system would have to collapse. Also what are called lawmakers are not the same as before. The people will have to stand up and say no! Now I know that may sound far out but that is the only way.
I have small children so system collapse is not an attractive solution. Real people taking civic action so that we can achieve change without destroying the good things that we take for granted would be my preference.
Hi Amy, I understand why you say that due to small children. I just do not see no other way at this point. We all will have to get dirty and uncomfortable. We have had a comfortable life here in America. America is going to have to be uncomfortable because convenience and comfort has got us here. I am also stating society now has been altered and not for the best. Peoples mindsets have been adjusted but for the best. The American diet must change which is enhancing the mental issues people possess along with what is called a healthcare system which is a poison based system to keep people sickly. That is why Things must comedown. This system has caused so much problems due to our comforts.
I'm always suspicious when a suicide conveniently allows the plundering class to keep on keeping on. Aaron Swartz, *TWO* Boeing whistleblowers, and Suchir Balaji.
In addition to Art Keller's comparison with Yann LeCun, these people want everything "democratized" so they can train on the entire "input" of humanity, but they are absolutely adamant that you will pay for the "output" whether it's with money or data. Them's the rules because they paid good money for lawmakers and judges to make them that way.
Most conspiracy theories fall apart when you realize how dumb and disorganized most people are. I can certainly imagine some of these tech CEOs wanting to see their critics dead, but them being able to actually do a murder for hire plot (pretty sure tech CEOs will not be doing their own killing) and not get found out?
No. I cannot see that happening.
It's much easier to believe the robber barons made the lives of the whistleblowers so completely hellish by using the legal system as an instrument of torture that death seemed preferable.
Suicide is possible, but in Balaji's case, there are messages on X attributed to his parents that contradict the suicide hypothesis. Of course, this needs to be verified, as they could be fake, but that's the police job. You can check the account https://x.com/RaoPoornima/ on X.
Given the large sums of money involved, it's possible organized crime might be involved. This could be a motive for foul play, perhaps to protect investments or for future gains. It sounds like the plot of a thriller movie!
That said, the Earth is round, vaccines are effective, climate change is real and caused by humans activities, and sometimes, conspiracies do exist.
It's true, most conspiracy theories are nonense. Most. But not all. Kennedy, Watergate, the Bay of Pigs, Iran/Contra etc. were all actual conspiracies that were hidden—or attempted to be hidden—from the public at large, and only partially unentangled. Same with Condon Report. So there are bonafide conspiracies that do occur. All these recent whistleblowers ending up dead, when billions are at stake, is highly suspicious and a level of coincidence that, frankly, belies belief. One of the Boeing whistleblowers actually made a point of formally stating to family and friends that he was NOT in any way suicidal and that if he ended up dead it was not by his own hand and an investigation should be launched.
Examine your thinking. You are trying to create patterns without sufficient evidence. It's desirable to have everything make sense, that people should not be so miserable after doing the right thing that they kill themselves, but it happens. Good people kill themselves all the time.
Spinning conspiracies out of nothing is bad. It whips up mobs into doing tragic things like reject use of vaccines.
Let's examine *your* thinking for a moment, which appears to be a bit uncritical, frankly. Yes, it is true that conspiracy theories more often than not turn out to be false. Or as Niccolo Machiavelli said (and he would know), "Always assume incompetence before looking for conspiracy." With that being said, it's not about what's theoretical. It's not even about Occam's Razor. You have to carefully gather and examine all the data. So while it's true to say that most conspiracies turn out to be false, most is not all (as I indicated citing examples above). Just because you believe something "cannot be," does not *automatically* mean that it "isn't."
Especially if you use the copyrighted material owned by large media conglomerates. They've already had to pay out an incredible sum to the New York Times. If it turns out there's a bunch of Disney and/or Time Warner material that's been fed into those algorithms they're done for. The bubble will pop the instant either of those media empires dispatches their lawyers to crush them.
Item of interest. Have a friend who was preparing an article on US Presidents. He had Grok generate images for the article - in most cases he got a good image, but in 3 of the presidents he couldn't get an accurate image.
LLMs can't tell you what data they were trained on, but it appears that if there are enough labeled images, it can get pretty close.
Tried to get an image of Mickey Mouse, it refused based on copyright - but it offered this: "...I can create an image of a character inspired by Mickey Mouse...". I'll bet Mickey is somewhere in the trailers.
There is clear evidence that the makers of these tools know that their fair use argument is weak, most obviously in that they have tried to hide their training data. Values and principles just don’t exist, it seems.
I worry that the US courts are moving too slow here. With Sam Altman literally paying homage to Trump post election it feels like OpenAI and Microsoft will be ramping up the lobbying in DC next year to carve out an explicit fair use exception for pre training. And know their anticompetitive tendencies I’m sure they’ll bake the law in a way that only the top labs get thar fair use exception.
Is this one of those "suicides" like when a person shoots themselves in the back of the head? Sort of like the Boeing whistleblower "suicides" from earlier this year?
On the four-factor test: I’m not a lawyer, but I don’t think it’s at all clear-cut that the use of copyrighted material in training LLMs is a violation of copyright, even when portions of the copyrighted material are reproduced verbatim in the output.
The questions that still have to be answered in the courts are:
Is the purpose and character of the use (factor 1) primarily of a commercial nature, or for nonprofit educational purposes? Here OpenAI’s unique business structure may actually work in its favor (and indeed may very well have been a factor when it was set up that way).
What is “the amount and substantiality of the portion used in relation to the copyrighted work as a whole” (factor 3)? In Campbell vs Acuff-Rose Music, the Supreme Court on remand required the lower courts to consider the “transformative elements” of the derivative work, and “transformation” has since become a key part of the legal cases when the third factor is invoked. Just because a substantial portion of the output from a copyrighted work is duplicated does not necessary mean that the copyrighted work hasn’t been transformed, under law.
What is the effect of the derivative work on the potential market for, or value of, the copyrighted work (factor 4)? Is the value of, say, the Mario Bros. reduced by the use of DALL-E to generate an image that’s used in a different context? The courts have already dismissed or greatly reduced the ability of plaintiffs claiming copyright violations by requiring them to show that they have been harmed by the availability of their copyrighted works inside of an AI tool, and the courts do seem at least initially inclined to a position that there is no “market substitution” in the way that copyrighted works are used within AI.
I'm also skeptical that current copyright law is up for this task, but like you I have no legal expertise. Seems to me that "training an AI" is a use case that simply has no good analogues in the world that existed before GenAI.
I hope I'm wrong, because as far as I'm concered this is flagrant theft, and they're getting away with it because most people don't understand that the kind of "learning" LLMs to is radically different from the "learning" humans do. The tech companies say "hey, if people are allowed to learn from copyrighted material how to be creative, AIs should too", which is both specious and compelling to a lay person.
We need new legislation on this. I'm not optimistic congress is up for it. My one hope is that the traditional IP holding corporations (Disney, Sony...) hold more political sway than the tech companies.
OpenAI is transitioning towards a for profit structure, though. I don't know that telling a judge you were a non-profit when the alleged infringement first occurred will hold much weight, but I'm not a lawyer.
What many people miss in these debates is how asynine popular culture ergo culture in the Western world has become since the dawn of the Internet age, accelerated by the social media age. Spotify created a culture of commodification of music. It's the John the Baptist to the anti-Jesus of fake tunes. It's Christmas 2024 in the privileged Western world. Look at the music charts : Mariah, Wham, The Pogues, Sinatra. Big tech destroys culture
AI training data issues go beyond copyright—they also involve privacy. AI labs often scrape any data they can access, using web crawlers, obscure opt-out terms of service, and potentially other questionable methods, with little to no concern for ethical considerations. This serious issue has been ignored from the beginning, with no regulatory framework or oversight in place.
As the most visible AI lab, OpenAI has a responsibility to set a positive example—a responsibility they have largely failed to meet. The glaring lack of transparency and accountability in the industry cannot be overlooked. Even simple processes like reverse search or addressing copyright reveal significant flaws in data governance and lifecycle management.
In this deep dive, I cover the main issues and debunk common counterarguments.
Saltman hasn't said a word as far as I know. He doesn't even make a public condolence tweet for a former employee who died? He's just carrying on with his stupid 12 days of AI hype Christmas regardless.
After co-writing my book, Understanding Machine Understanding, with Claude 3.0 Opus, I found out that most of the traditional publishing houses will not touch a manuscript with generative AI content. I understand their position. There are big copyright risks if it turns out that your AI assistant copied material without citation (AI plagiarism). Also, you can't get a valid copyright if too much of it was not the work of a human author, and no one knows where to draw that line. This is on the output side without going into the general issue of the AI companies legal problems when they use copyrighted material for training.
Listening to songs, reading books, or looking at images created by another person for inspiration, not for copying, has been recognized as fair use. However, training a commercial product at an industrial scale on copyrighted content for free to then sell services to millions of people, competing directly against copyrighted sources, and potentially generating billions in profit, should not constitute fair use.
As Elon Musk “candidly” admitted (cnb.cx/3RQG7wv), all the GenAI businesses start with a massive « theft » of all data available. All those corporate hackers don't care about copyright. They've understood that they will be able to pay the best lawyers to defend them, going from legal appeals to recourses.
If he was an Indian-American, was he 1st of 2nd generation immigrant? Was he a US Citizen or was he a worker on an H-1B? Would have been nice if someone had reported on that.
My condolences to his family, friends, loved ones.
Louis Hunt of LiquidAI has published a very interesting post on LinkedIn about regurgitation (https://bit.ly/3W3Fwcw) that corroborates the analyses of Suchir Balaji that he cites. The tests have been done with LlaMA but it should give similar results with ChatGPT.
My fear is that the Trump administration, in cahoots with the tech billionaires, will change the fair use laws in the USA. Could be my prediction for 2025.
Meanwhile, everybody's favorite AI developer, Yann LeCun, is calling for every repository of human knowledge to give AI companies free access to their treasure troves. Something sickening about the richest and most powerful people building their tools for mass unemployment off the stolen creative work of humanity. They could go slower in development, and only use training data they licensed. They don't want to. Stealing is cheaper, and Silicon Valley is packed with people who see that theft as fine. If fact, they call it "FREEING THE DATA!" As if the novels and articles I WROTE somehow want to jump into their shady LLMs so that I have even less chance to get paid to write in the future. It is IP theft on a scale unprecedented in human history. It SHOULD shock the conscience. It appears the first rule of going to work for a big AI company is a conscience-ectomy. Those are the people building our brave new world.
This is not the Yann LeCun I knew. That Yann LeCun was instrumental in creating a new technology that converted an internet over raw data into an internet of known objects. The value of that work was real, tangible, and electrifying. No one had to sell its value by heavy-handed marketing
That was also a time when every major effort I knew to create databases of human knowledge began with Wikipedia and strictly emphasized privacy and respect for intellectual property.
Times and people change, not always for the better.
The problem extends well beyond the relatively small number people working for LLM companies.
The fact that so few computer scientists even acknowledge — to say nothing of speak out on — the intellectual property issue* makes the situation exceedingly difficult for the ones like Suchir Balaji and Gary Marcus who do.
* and other issues like racial and other biases associated with the use of AI in general.
This has been going on for decades. This A.I thing is the second go round. This went on in the zeroes with what is called social media. No copyrights were paid. So I am saying why are people shocked! This has been consistent behavior that just choose not to see.
I have been saying this for two plus decades when it comes to Silicon Valley. They have been involved in this criminal activity with the co sign wink wink of the so called law makers. Nothing new here. I am just glad folks are finally acknowledging that much of this innovation is based on fraudulent acts of crime.
With money as speech, the lawmakers who might oppose this have an impossible time getting elected.
Thank you for that Amy. I do not see the lawmakers as the solution now because the system is so co opted. The system would have to collapse. Also what are called lawmakers are not the same as before. The people will have to stand up and say no! Now I know that may sound far out but that is the only way.
I have small children so system collapse is not an attractive solution. Real people taking civic action so that we can achieve change without destroying the good things that we take for granted would be my preference.
Hi Amy, I understand why you say that due to small children. I just do not see no other way at this point. We all will have to get dirty and uncomfortable. We have had a comfortable life here in America. America is going to have to be uncomfortable because convenience and comfort has got us here. I am also stating society now has been altered and not for the best. Peoples mindsets have been adjusted but for the best. The American diet must change which is enhancing the mental issues people possess along with what is called a healthcare system which is a poison based system to keep people sickly. That is why Things must comedown. This system has caused so much problems due to our comforts.
Not only is AI bad at copyright, it also doesn't seem to understand that the Avengers are Marvel and Superman is DC.
That is disrespectful! For them it is just content not media. It is free not sweat and toiling to create.
LLM based AI knows nothing!
It is truly a dumb pattern follower, not an intelligence.
It AIn’telligence, that’s for sure
I'm always suspicious when a suicide conveniently allows the plundering class to keep on keeping on. Aaron Swartz, *TWO* Boeing whistleblowers, and Suchir Balaji.
In addition to Art Keller's comparison with Yann LeCun, these people want everything "democratized" so they can train on the entire "input" of humanity, but they are absolutely adamant that you will pay for the "output" whether it's with money or data. Them's the rules because they paid good money for lawmakers and judges to make them that way.
Yes, these "suicides" are awfully convenient, aren't they? In each and every case, potentially billions of dollars at stake.
Most conspiracy theories fall apart when you realize how dumb and disorganized most people are. I can certainly imagine some of these tech CEOs wanting to see their critics dead, but them being able to actually do a murder for hire plot (pretty sure tech CEOs will not be doing their own killing) and not get found out?
No. I cannot see that happening.
It's much easier to believe the robber barons made the lives of the whistleblowers so completely hellish by using the legal system as an instrument of torture that death seemed preferable.
Suicide is possible, but in Balaji's case, there are messages on X attributed to his parents that contradict the suicide hypothesis. Of course, this needs to be verified, as they could be fake, but that's the police job. You can check the account https://x.com/RaoPoornima/ on X.
Given the large sums of money involved, it's possible organized crime might be involved. This could be a motive for foul play, perhaps to protect investments or for future gains. It sounds like the plot of a thriller movie!
That said, the Earth is round, vaccines are effective, climate change is real and caused by humans activities, and sometimes, conspiracies do exist.
It's true, most conspiracy theories are nonense. Most. But not all. Kennedy, Watergate, the Bay of Pigs, Iran/Contra etc. were all actual conspiracies that were hidden—or attempted to be hidden—from the public at large, and only partially unentangled. Same with Condon Report. So there are bonafide conspiracies that do occur. All these recent whistleblowers ending up dead, when billions are at stake, is highly suspicious and a level of coincidence that, frankly, belies belief. One of the Boeing whistleblowers actually made a point of formally stating to family and friends that he was NOT in any way suicidal and that if he ended up dead it was not by his own hand and an investigation should be launched.
Examine your thinking. You are trying to create patterns without sufficient evidence. It's desirable to have everything make sense, that people should not be so miserable after doing the right thing that they kill themselves, but it happens. Good people kill themselves all the time.
Spinning conspiracies out of nothing is bad. It whips up mobs into doing tragic things like reject use of vaccines.
Let's examine *your* thinking for a moment, which appears to be a bit uncritical, frankly. Yes, it is true that conspiracy theories more often than not turn out to be false. Or as Niccolo Machiavelli said (and he would know), "Always assume incompetence before looking for conspiracy." With that being said, it's not about what's theoretical. It's not even about Occam's Razor. You have to carefully gather and examine all the data. So while it's true to say that most conspiracies turn out to be false, most is not all (as I indicated citing examples above). Just because you believe something "cannot be," does not *automatically* mean that it "isn't."
Fair use is a defense, not a right. If you use copyrighted material and make a lot of money, you're going to get sued - even if you us just a little.
Especially if you use the copyrighted material owned by large media conglomerates. They've already had to pay out an incredible sum to the New York Times. If it turns out there's a bunch of Disney and/or Time Warner material that's been fed into those algorithms they're done for. The bubble will pop the instant either of those media empires dispatches their lawyers to crush them.
Item of interest. Have a friend who was preparing an article on US Presidents. He had Grok generate images for the article - in most cases he got a good image, but in 3 of the presidents he couldn't get an accurate image.
LLMs can't tell you what data they were trained on, but it appears that if there are enough labeled images, it can get pretty close.
Tried to get an image of Mickey Mouse, it refused based on copyright - but it offered this: "...I can create an image of a character inspired by Mickey Mouse...". I'll bet Mickey is somewhere in the trailers.
Yes, it did pretty well.
Condolences to his family and friends.
There is clear evidence that the makers of these tools know that their fair use argument is weak, most obviously in that they have tried to hide their training data. Values and principles just don’t exist, it seems.
I worry that the US courts are moving too slow here. With Sam Altman literally paying homage to Trump post election it feels like OpenAI and Microsoft will be ramping up the lobbying in DC next year to carve out an explicit fair use exception for pre training. And know their anticompetitive tendencies I’m sure they’ll bake the law in a way that only the top labs get thar fair use exception.
Is this one of those "suicides" like when a person shoots themselves in the back of the head? Sort of like the Boeing whistleblower "suicides" from earlier this year?
Yep, like the suicide with 2 holes in the back of the head...
On the four-factor test: I’m not a lawyer, but I don’t think it’s at all clear-cut that the use of copyrighted material in training LLMs is a violation of copyright, even when portions of the copyrighted material are reproduced verbatim in the output.
The questions that still have to be answered in the courts are:
Is the purpose and character of the use (factor 1) primarily of a commercial nature, or for nonprofit educational purposes? Here OpenAI’s unique business structure may actually work in its favor (and indeed may very well have been a factor when it was set up that way).
What is “the amount and substantiality of the portion used in relation to the copyrighted work as a whole” (factor 3)? In Campbell vs Acuff-Rose Music, the Supreme Court on remand required the lower courts to consider the “transformative elements” of the derivative work, and “transformation” has since become a key part of the legal cases when the third factor is invoked. Just because a substantial portion of the output from a copyrighted work is duplicated does not necessary mean that the copyrighted work hasn’t been transformed, under law.
What is the effect of the derivative work on the potential market for, or value of, the copyrighted work (factor 4)? Is the value of, say, the Mario Bros. reduced by the use of DALL-E to generate an image that’s used in a different context? The courts have already dismissed or greatly reduced the ability of plaintiffs claiming copyright violations by requiring them to show that they have been harmed by the availability of their copyrighted works inside of an AI tool, and the courts do seem at least initially inclined to a position that there is no “market substitution” in the way that copyrighted works are used within AI.
I'm also skeptical that current copyright law is up for this task, but like you I have no legal expertise. Seems to me that "training an AI" is a use case that simply has no good analogues in the world that existed before GenAI.
I hope I'm wrong, because as far as I'm concered this is flagrant theft, and they're getting away with it because most people don't understand that the kind of "learning" LLMs to is radically different from the "learning" humans do. The tech companies say "hey, if people are allowed to learn from copyrighted material how to be creative, AIs should too", which is both specious and compelling to a lay person.
We need new legislation on this. I'm not optimistic congress is up for it. My one hope is that the traditional IP holding corporations (Disney, Sony...) hold more political sway than the tech companies.
OpenAI is transitioning towards a for profit structure, though. I don't know that telling a judge you were a non-profit when the alleged infringement first occurred will hold much weight, but I'm not a lawyer.
What many people miss in these debates is how asynine popular culture ergo culture in the Western world has become since the dawn of the Internet age, accelerated by the social media age. Spotify created a culture of commodification of music. It's the John the Baptist to the anti-Jesus of fake tunes. It's Christmas 2024 in the privileged Western world. Look at the music charts : Mariah, Wham, The Pogues, Sinatra. Big tech destroys culture
Right on, you are!
AI training data issues go beyond copyright—they also involve privacy. AI labs often scrape any data they can access, using web crawlers, obscure opt-out terms of service, and potentially other questionable methods, with little to no concern for ethical considerations. This serious issue has been ignored from the beginning, with no regulatory framework or oversight in place.
As the most visible AI lab, OpenAI has a responsibility to set a positive example—a responsibility they have largely failed to meet. The glaring lack of transparency and accountability in the industry cannot be overlooked. Even simple processes like reverse search or addressing copyright reveal significant flaws in data governance and lifecycle management.
In this deep dive, I cover the main issues and debunk common counterarguments.
https://ai-cosmos.hashnode.dev/the-great-opacity-why-ai-labs-need-to-come-clean-about-their-data
Saltman hasn't said a word as far as I know. He doesn't even make a public condolence tweet for a former employee who died? He's just carrying on with his stupid 12 days of AI hype Christmas regardless.
After co-writing my book, Understanding Machine Understanding, with Claude 3.0 Opus, I found out that most of the traditional publishing houses will not touch a manuscript with generative AI content. I understand their position. There are big copyright risks if it turns out that your AI assistant copied material without citation (AI plagiarism). Also, you can't get a valid copyright if too much of it was not the work of a human author, and no one knows where to draw that line. This is on the output side without going into the general issue of the AI companies legal problems when they use copyrighted material for training.
Listening to songs, reading books, or looking at images created by another person for inspiration, not for copying, has been recognized as fair use. However, training a commercial product at an industrial scale on copyrighted content for free to then sell services to millions of people, competing directly against copyrighted sources, and potentially generating billions in profit, should not constitute fair use.
As Elon Musk “candidly” admitted (cnb.cx/3RQG7wv), all the GenAI businesses start with a massive « theft » of all data available. All those corporate hackers don't care about copyright. They've understood that they will be able to pay the best lawyers to defend them, going from legal appeals to recourses.
If he was an Indian-American, was he 1st of 2nd generation immigrant? Was he a US Citizen or was he a worker on an H-1B? Would have been nice if someone had reported on that.
My condolences to his family, friends, loved ones.
Louis Hunt of LiquidAI has published a very interesting post on LinkedIn about regurgitation (https://bit.ly/3W3Fwcw) that corroborates the analyses of Suchir Balaji that he cites. The tests have been done with LlaMA but it should give similar results with ChatGPT.
My fear is that the Trump administration, in cahoots with the tech billionaires, will change the fair use laws in the USA. Could be my prediction for 2025.