161 Comments
Dec 29, 2023·edited Dec 29, 2023Liked by Gary Marcus

I find the AI industry's answers to accusations of infringement ludicrous. They are using huge amounts of IP they don't own as source code for a software application and making money off it. Calling it "training" and "learning" changes nothing. The fact their process converts the content into billions of parameters changes nothing.

Statistical analysis of content is fair use. It is equivalent to reading. But as soon as LLMs start producing words and expressions of ideas that are solely based on this massive amount of ripped off data, they are competing with the owners of the IP. This is infringement, plain and simple.

As the US Copyright Office says: "What is copyright infringement? As a general matter, copyright infringement occurs when a copyrighted work is reproduced, distributed, performed, publicly displayed, or made into a derivative work without the permission of the copyright owner."

The fact that they infringe many copyrights shouldn't let them off the hook either. Rip off one copyright holder, you are infringing. Rip off thousands or millions of them, then you're ok. I don't think so!

Expand full comment

During training the models minimize the difference between their output and the training data, so if the model is well trained it would output stuff very close to the training data under the same conditions that were used during the training, so it could be argued that a very close copy of the training data exists encoded in the model's weights and can be obtained with the right prompting. Isn't that grounds for copyright infringement? It's like a jpegged copy of an image is still a copy even though the image is encoded and to get it one needs a jpeg decoder.

Expand full comment
Dec 29, 2023Liked by Gary Marcus

A classic example of "legal/regulatory entrepreneurship," in which a company breaks the rules and comes into a space not previously violated and see how far you can get while building usage and market share (see Uber and the NY taxi cabs). So we can expect the lawyers and negotiations to begin. And at some point attempts to change the law (see independent contractors in CA). The small creators and owners of IP can only hope that the big guys, like the NYT, can get settlements and rules set (with penalties and enforcement) to help them. The Author's Guild lawsuit against these companies is also interesting and worth following for the many published authors. Thanks again Gary for the work and clarity!

Expand full comment

My guess is that they expect to get away with it because, at least according to my copyright professor in library school, Google’s business model was illegal until the courts decided that the value of a more searchable web outweighed the harm of copying without permission. My bigger question is, how much value do these tools even have? Information with questionable provenance and hallucinations may accelerate the enshitification of the internet. Though I’m sure OpenAI and friends will be happy to sell us a solution to the problems they’re creating.

Expand full comment
Dec 29, 2023Liked by Gary Marcus

Didn't Adobe create an image generating AI that was only trained on public-domain images, for just this sort of situation? Maybe their market share is about to increase.

Expand full comment
Dec 29, 2023Liked by Gary Marcus

Thanks for this Gary. Big Tech has gone on believing it can do whatever it wants...but this is like a slow boil which will leave the guts generative Al all over the kitchen floor. I wonder who will have to clean up the mess!

I’m not one to predict as that assures me of being wrong. So I’ll point to two smart guys who wrote a book (Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity) -- MIT Professors Darin Acemoglu and Simon Johnson. It should make clear that generative AI may not workout as planned.

Expand full comment

Once again, as with self-driving cars, we can create technology that races way ahead of our legal frameworks. This is going to require lawsuits to work through.

Expand full comment

Gary, great news. I've liked, restacked, and cross posted.

Expand full comment

I don't understand the problem here. I have numerous artist friends that can reproduce a picture almost perfectly. They often do this too, usually for a laugh or to give to someone as a present. Since they are not making any money off this, nor are they trying to pass the pictures off as their own, no copyright infringement has taken place.

I certainly don't understand the fuss of a computer performing analysis on public works on the internet. After all, everything you view on the internet is copyrighted unless the creator has explicitly made it public domain. Does google need permission to index copyrighted web pages?

This all seems like massive overreach of copyright law.

Expand full comment

Interesting territory we have wander into... As a teacher and instructional designer, I am trying to design a method for implementing and integrating AI into my writing classroom. First major hurdle: effortless production of mediocre text. First major response: Know your students’ writing voices, change your prompts and assignments. Second major hurdle: Hallucinations. Second major response: ChatGPT 4 and Perplexity can fact check content in real-time. Third major challenge: Trained on copyrighted materials. Third response: Who knows? How can I tell ask a student to use a system to assist in their studies that is a plagiarist by nature.

Expand full comment

Raises the issue of the scammy nature of the length of copyright, which survives the author's life for a whopping 70 years. Given the advance in technology and that innovation is based on improving what already exists, copyright should be like patents and last 20 years or so. LLMs should not be kneecapped by ridiculous copyright laws.

Expand full comment

I reported on this sometime ago too. And realize just because these images occur does not mean infringement.

Expand full comment

And trademark really is a much harder nut to crack here. The 'certainty principle' of genAI here is "memorisation equals training data leakage"

Expand full comment

Good LLM hacking. This will be really hard to fix, all dataset used in training should not have copyright

Expand full comment

Copyright is dying thanks to generative AI. Good riddance, I just hope it happens as fast as possible.

Expand full comment

Microsoft's Cash on Hand as of September 2023 : $143.94 B. They will be fine.

Expand full comment