162 Comments
Dec 29, 2023·edited Dec 29, 2023Liked by Gary Marcus

I find the AI industry's answers to accusations of infringement ludicrous. They are using huge amounts of IP they don't own as source code for a software application and making money off it. Calling it "training" and "learning" changes nothing. The fact their process converts the content into billions of parameters changes nothing.

Statistical analysis of content is fair use. It is equivalent to reading. But as soon as LLMs start producing words and expressions of ideas that are solely based on this massive amount of ripped off data, they are competing with the owners of the IP. This is infringement, plain and simple.

As the US Copyright Office says: "What is copyright infringement? As a general matter, copyright infringement occurs when a copyrighted work is reproduced, distributed, performed, publicly displayed, or made into a derivative work without the permission of the copyright owner."

The fact that they infringe many copyrights shouldn't let them off the hook either. Rip off one copyright holder, you are infringing. Rip off thousands or millions of them, then you're ok. I don't think so!

Expand full comment
Dec 31, 2023·edited Dec 31, 2023

> I find the AI industry's answer to accusations of infringement ludicrous.

I do too, the problem is that copyright as its written and enforced is fundamentally inconsistent with free speech, and its been granted a perpetuity that was never a benefit to society.

Worse, computers cannot fundamentally solve problems related to a dualism of context. Very technically, the only reason computers perform work is because of system's properties which are built-in at the hardware (signal domain) level. Software then institutes limitations that preserve these properties which include determinism and time invariance among others.

There is no solution that doesn't involve seriously reworking copyright, and property rights should be upheld, but in all fairness IP rights though should never have been granted above those of actual patents, and should be tied to actually consistently selling physical goods.

As it stands now they provide little of persisting value except for tax gimmicks. Value can only ever be subjective, and granting a moratorium on cultural concepts almost a hundred years after the author dies is stupid as it can allow revisionists the coercive power to corrupt cultural history.

Edit: Its also impossible to copyright math, and a lot of generative AI is just math. Scope creep should be corrected in a way that enforcement is consistent and immediate, and AI in general is a perfect weapon against any society based in a distribution of labor. It needs to go.

Expand full comment

Copyright is a huge benefit to society. It makes it possible for artists, writers, programmers., etc. to make a living. I don't think there's a free speech basis for allowing someone to make use of the work of others for profit. Fair use covers reviews, commentary, journalism, search, etc. I don't see using content to teach an LLM as fair use but it is debatable.

The problem is easily solved by extending (or not) the concept of free use to cover current AI projects. If an AI company's use of content is NOT covered under free use, then they have to seek a license from the copyright holders. Since LLMs are going to be with us for the foreseeable future, some standardization would help. Perhaps content owners could agree on standard licensing terms. On the other hand, perhaps there's no deal that will make both parties happy and LLMs will become a historical footnote.

Math is not coverable under copyright but that doesn't matter here. It's not like if you use math in your product then you get a free ticket to violate copyrights.

Value easily becomes objective by the usual process of buyer and seller agreeing on a price.

Typing all IP to only physical goods would kill innovation and create a worldwide economic slump.

Expand full comment
Jan 1·edited Jan 1

> I don't think there's a free speech basis for allowing someone to make use of work of others for profit....

Then you clearly have not rationally thought through the implications of your proposal, worse you have overgeneralized, a fallacy of flawed logic and irrational thought.

The problem is not so easily solved. There are numerous issues, one of which is the subjectivity for what constitutes copyright. Just like economic value, which is inherently subjective (as proved by Carl Menger in the 1950s, not objective), there is no universal agreed upon definition of copyright that is consistent; it nearly always comes down to what an adjudicator says (a bureacracy). With the barrier to entry to defend such being high monetarily only those who have the money will benefit; certainly not society as a whole especially in the case where copyrighted works are not being sold in any physical media (such as is now happening on streaming platforms), people want to buy these things but they can't. I wanted to get a copy of Altered Carbon for my collection since our internet goes out so often unexpectedly but they refuse to sell aside from using their platform, they dictate how and what I can watch in the confines of my own home coercively.

Worse, the enforcement bureacracy does not scale. This structurally and inevitably leads to proposals of greater coercive control, often without due process, following hegelian dialectics or what's more commonly known as shock doctrine. Crises are engineered, and those entrusted violate that trust. We have seen this over the past two decades with abuses of the DMCA notice systems, today our society is ineffably but surely less than it was 20 years ago. These frameworks are used to censor thought, thought reform, and simply by anyone claiming it violates copyright you can be silenced or have your production stolen. People lie when it benefits them in systems that are fundamentally corrupt.

You say math is not coverable under copyright, and the law says it strictly isn't, but then this also is not consistent. Anything a computer touches is technically math, but this isn't recognized as you just so aptly pointed out. So by further extension of copyright you now have software that is strictly math, that is copyright; this is commonly seen in software patents. This may go so far as to include common interfaces, and that now lets you, the copyright holder, to prevent innovation or interoperability since derivative works are considered part of copyright. You've built a moat, and locked the market into a concentration cycle where the market shrinks.

In other words through coercive use of copyright you can now have a spectrum of math that is consistently claimed as copyright. Encoded content, this also runs into the same issues with DMCA notice systems. You simply claim its the same thing, with the reasoning being that the hashes match, but hashes are not unique. They have collisions, so now you have copyright further extended. This cycle continues repeatedly until things become intolerable, or people can no longer meet bare subsistence. These tactics they are not just applying to copyright, they apply to all business in the market that copyright touches.

There can be no innovation when copyright is used as a roadblock for entering the market. It can be applied to most anything indirectly. The copyright lasts longer than a human lives, and those controlling it through partnership with banks you have old-style feudalism, but worse.

There is no future when you can't innovate, and coercive control prevents innovation systemically by those with a vested interest in preserving the status quo.

The fear of being ruined just for trying something out will act to shrink markets.

"When you have made evil the means of survival, do not expect men to remain good. Do not expect them to stay moral and lose their lives for the purpose of becoming the fodder of the immoral. Do not expect them to produce, when production is punished and looting rewarded. Do not ask, ‘Who is destroying the world?’ You are.” <Atlas Shrugged, but very appropriate>

The structural issues and failure domains I've mentioned were all known in the 1950s, covered in greatest detail by Ludwig von Mises in his essays on Socialism; most people haven't received proper education in rational thinking and choose instead to try to lie to themselves and others about a way out.

The type of world you are promoting is one of intolerable suffering, and has no future. Are you really sure you want that for yourself or your children?

Carl Menger showed thoroughly and in great detail that value can only ever be subjective. I refer you to his published works, if you have some monumental contribution that overturns this rationally, I'm sure most of academia would want to hear about it but I'll temper my expectations of that happening.

You fundamentally don't understand quite a lot with regards to how coercion, corruption, centralized systems, or bureaucracies work, nor how they fail.

You don't hire a a non-engineer to build a dam and expect anything except failure.

Expand full comment

"Anything a computer touches is technically math"

This is so naïve as to make the rest of this diatribe not worth reading.

Expand full comment
Jan 1·edited Jan 1

You can think whatever you want, even false things at your and your family's own peril, pushing for your own destruction and those around you is what happens when you can't discern truth from falsehood. In this case you would be dead wrong about this. Computers can only operate on math, the things they touch are math, and you can't copyright math but that hasn't stopped foolish individuals from pushing magical irrational thinking.

It happens to be true, but you need a fairly deep background in theoretical math which gets applied to system's and signals and mostly abstracted away.

For those without the exposure to this background, that is the set of classes after Intermediate Calculus (i.e. Discrete Mathematics, Linear Algebra, Abstract Math [Modern Algebra as mathematicians call it], and Combinatorics).

Only a fool discounts and minimizes what he/she doesn't know, and when it comes to safety-critical systems if that fool is running towards a cliff to jump off not seeing it. Most good people would warn them, but then afterwards just get out of the way and let nature run its course. Basic survival of the fittest.

If you think your doing good, you'll be in for a very rude surprise when you get to those pearly gates and get turned away. Choosing to believe in things that are untrue doesn't absolve any responsibility for your actions when things predictably fail with negative consequences.

One of the few true evils is in not thinking. That doesn't require agreement, it requires proper rational discernment which you seem incapable of.

Expand full comment
Mar 29·edited Mar 30

While I agree with your premise that everything a computer sees is mathematics, not everything a human sees is representative of (or representable by, many would argue) mathematics. Once a human sees it, the abstraction is lost, and the mathematical structure or object has become a realization to which we ascribe a descriptor, typically qualitative, and a value, also typically qualitative. It is to these values and descriptors as such that people claim some kind of right (or ownership), even though I disagree with their premise. On the other hand, I somehow agree with the line of your personal arguments (e.g. copyright claims abuse freedoms, and more). Things like Mario or Goku or the Mona Lisa or brands like Nike and many others are so universal that hiring an extremely knowledgeable artist to draw them for you should not be copyright infringement. I can think of many ways that AI companies could re-train their models on factual or learned information to transcribe or re-generate the information said differently, or the image or video generated over a slight variation or small disturbance, but even then copyright claimants would come hunting for profit (where do we draw the line?). I'm with the AI companies on this one, and anyone working against them seems to be working against innovation and creativity, because over time we can expect that sufficiently advanced models or algorithms will be able to generate emergent outputs of "good" (or "high") quality (e.g., novel and creative).

Expand full comment

During training the models minimize the difference between their output and the training data, so if the model is well trained it would output stuff very close to the training data under the same conditions that were used during the training, so it could be argued that a very close copy of the training data exists encoded in the model's weights and can be obtained with the right prompting. Isn't that grounds for copyright infringement? It's like a jpegged copy of an image is still a copy even though the image is encoded and to get it one needs a jpeg decoder.

Expand full comment
Comment deleted
Expand full comment

Great point! Especially because it is the "AI" fans who claim that :)

Expand full comment

A classic example of "legal/regulatory entrepreneurship," in which a company breaks the rules and comes into a space not previously violated and see how far you can get while building usage and market share (see Uber and the NY taxi cabs). So we can expect the lawyers and negotiations to begin. And at some point attempts to change the law (see independent contractors in CA). The small creators and owners of IP can only hope that the big guys, like the NYT, can get settlements and rules set (with penalties and enforcement) to help them. The Author's Guild lawsuit against these companies is also interesting and worth following for the many published authors. Thanks again Gary for the work and clarity!

Expand full comment

My guess is that they expect to get away with it because, at least according to my copyright professor in library school, Google’s business model was illegal until the courts decided that the value of a more searchable web outweighed the harm of copying without permission. My bigger question is, how much value do these tools even have? Information with questionable provenance and hallucinations may accelerate the enshitification of the internet. Though I’m sure OpenAI and friends will be happy to sell us a solution to the problems they’re creating.

Expand full comment

Given that to view a web page a copy has to be created on your local machine the whole internet would have had to be declared illegal with such a narrow reading of copyright law.

Expand full comment
Dec 31, 2023Liked by Gary Marcus

When you display a web page on your computer, there is clearly a reference on it, the data source is indicated. At least in most of the cases, because there are infringements. When an AI bot generates a picture or a text which is a copy of an existing one, there is no reference to any data source. The legal liability is on the AI tool provider side in my opinion, not on the user side, the user being not informed about the sources.

Expand full comment

It can't possibly be on the tool providers side as the tool is not publishing anything publicly. It has no intentionality. It's the way the user uses the content generated that leads to copyright infringement.

Expand full comment

I agree that there is no intention to infringement in the way the AI systems process the data. Moreover, at the present state of the technology the AI tools are not able to give their references. This should change by the way. But I am not sure that there is no a matter of publishing publicly. In my no expert opinion, there is. Generative AI publishes on demand, on purpose, according to the subject defined by the user. The generated content comes with no warning and no restriction for strictly personal or family use only (or I am not aware of). It comes also with no citation of sources. As if it was a completely new product at the full disposal of the user.

Expand full comment
Dec 31, 2023Liked by Gary Marcus

Google copying and databasing content is not the same as viewing content without saving or reusing it. Particularly given that publicly available websites are being viewed with the permission and intent of the publishers. Taking something with permission is not the same as asking for forgiveness after taking it without asking. Ethics. Eesh.

Expand full comment

I have no idea what you're responding to. No one copying and databasing content. Statistical analysis is performed on publicly available content.

Expand full comment
Dec 31, 2023Liked by Gary Marcus

They did copy and database the content. That’s why people refer to known datasets that have been used to build genAI tools. And the reason that OpenAI has tried to make it difficult to know what training data they took is that 1) they know it may not have been legal and they want to avoid liability and 2) much of the training data was offensive and they don’t want to be held responsible for training with racist and similar content.

Expand full comment

the fact that you think database is a verb tells me you need to spend a lot more time understanding how the internet works

Expand full comment

Didn't Adobe create an image generating AI that was only trained on public-domain images, for just this sort of situation? Maybe their market share is about to increase.

Expand full comment

Adobe's nominally ethical solution to this is not without its own pernicious and sinister forms of exploitation, at best, if not outright infringement.

Expand full comment
Dec 30, 2023Liked by Gary Marcus

There are some variants of Creative Commons Copyrights that Adobe et al. could be infringing on, and perhaps others not, but I'm not in a position to say.

Expand full comment

Infringement is spurious and copyright should not exist :)

Expand full comment

I think that at the present stage when there is no regulation on generative AI and protection of IP incorporated in the training data, these systems should be trained on public domain content only. That is the first rule to be enforced, and there is no need for new laws to enforce it in my opinion. The companies providing generative AI tools are breaking the law just now.

Expand full comment
Dec 29, 2023Liked by Gary Marcus

Thanks for this Gary. Big Tech has gone on believing it can do whatever it wants...but this is like a slow boil which will leave the guts generative Al all over the kitchen floor. I wonder who will have to clean up the mess!

I’m not one to predict as that assures me of being wrong. So I’ll point to two smart guys who wrote a book (Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity) -- MIT Professors Darin Acemoglu and Simon Johnson. It should make clear that generative AI may not workout as planned.

Expand full comment

Wonderful book! We had Simon Johnson interviewed at the Commonwealth Club of CA this year (you can find it in our Past Events area). Read the book and found their emphasis on the choices we can make as a society compelling. We just need an informed and active public!

Expand full comment

Thanks! Will look for the interview

Expand full comment

Once again, as with self-driving cars, we can create technology that races way ahead of our legal frameworks. This is going to require lawsuits to work through.

Expand full comment

Gary, great news. I've liked, restacked, and cross posted.

Expand full comment

I don't understand the problem here. I have numerous artist friends that can reproduce a picture almost perfectly. They often do this too, usually for a laugh or to give to someone as a present. Since they are not making any money off this, nor are they trying to pass the pictures off as their own, no copyright infringement has taken place.

I certainly don't understand the fuss of a computer performing analysis on public works on the internet. After all, everything you view on the internet is copyrighted unless the creator has explicitly made it public domain. Does google need permission to index copyrighted web pages?

This all seems like massive overreach of copyright law.

Expand full comment

OpenAI is selling their copying service for money and granting ownership of the copies to customers. Your friend is not.

Expand full comment

OpenAI is not a copying service, so that's already wrong. They are granting access to a tool, just like Adobe grant's access to users of Photoshop. If the users do something illegal with the tool, that's still the user's problem, not the tools. Finally as I said, my friend also gives away copies to friends who now have ownership of that copy. Still not copyright infringement.

Expand full comment

Two different issues here that people don't usually keep apart in this discussion: (1) Did the developers have the right to use all the works they used for training? (2) Is it copyright infringement when PlagiarismBot 4.0 reproduces copyrighted or trademarked material? In the second case, it seems to me that it is indeed mostly about how the user uses PlagriarismBot, with the caveat that the examples posted in the OP look like the bot make be very difficult for them to not do a plagiarism by accident.

But the first case is a question entirely about the developers who are earning money with the models. The analogy here is the friend being a slave who is trained to copy an artist's work so that the slave owner can earn money from selling works that the original artist would have sold instead. IANAL, but if somebody builds a model from my work with the explicit intention of monetising how that model puts me out of business, I would start asking pointed questions about whether they were allowed to do that.

Expand full comment

This sort of thing has always had the same answer, don't want anyone to see your work, don't put it on the public internet. If OpenAI had used private content not easily available, then yes, they should be prosecuted

Expand full comment
Jan 4·edited Jan 4Liked by Gary Marcus

Replying to your comment about "it wasn't private material": "Publicly available" does not equal "in the public domain". Physical/dead tree books and newspapers are publicly available, but still protected by copyright law. Even if you find a copy of the NYT in a trash bin, you are not at liberty to reproduce it. You have to comply with Fair Use. You cannot, for instance, make copies of the articles in except under certain circumstances; ask any teacher or student. You certainly cannot make copies, bind them in a book, and sell it. Nor can you distribute them for free, either on paper or in digital form. And the lawsuit contains examples of extensive passages from NYT articles being reproduced word for word by ChatGPT. You might check the NYT copyright policy; it allows people to make one digital and one physical copy for oersonal use. Commercial use is strictly forbidden and ChatGPT is certainly a commercial use at this point.

Expand full comment

You might want t5o read the whole thread rather than jumping into the centre of one. I never equated 'public to view' with 'public domain', but I will let you continue with this non-argument on your own time.

Expand full comment

That's not how any of that works, and surely you know that? The various Creative Commons non-commercial licenses mean that a work can well be public but you still can't use the IP to earn money, and these models are earning money. And trademarks are also publicly visible, yet you still don't get to sell your own Harry Potter comic or game without buying the rights to do that.

Expand full comment

OpenAI so far has not be showing using anyone's IP unless you classify statistically analysing a public web page as 'using IP'. An idea that has failed multiple times in the history of the Web. They are also not selling you any trademarks. However, you can use their tool to generate something that has copyright/trademark consequences if you try to make money from it... just like photoshop and many other tools, and surely you know that?

Expand full comment

This is what I was thinking, too. There's plenty of capability to produce copyrighted content without AI already, I was thinking the same about artists I know who can reproduce content. Or I can copy and paste an article and illegally claim it as my own, I don't need AI for that. The copyright focus should remain in the hands of the user, not the tool.

I'll be interested to see if the history of web scraping for indexing is used as part of the argument in AIs favor.

Expand full comment

Interesting territory we have wander into... As a teacher and instructional designer, I am trying to design a method for implementing and integrating AI into my writing classroom. First major hurdle: effortless production of mediocre text. First major response: Know your students’ writing voices, change your prompts and assignments. Second major hurdle: Hallucinations. Second major response: ChatGPT 4 and Perplexity can fact check content in real-time. Third major challenge: Trained on copyrighted materials. Third response: Who knows? How can I tell ask a student to use a system to assist in their studies that is a plagiarist by nature.

Expand full comment

Raises the issue of the scammy nature of the length of copyright, which survives the author's life for a whopping 70 years. Given the advance in technology and that innovation is based on improving what already exists, copyright should be like patents and last 20 years or so. LLMs should not be kneecapped by ridiculous copyright laws.

Expand full comment

Trademark is essentially without end date. It exists as long as it is enforced.

Expand full comment

Trademark is outside of this discussion and only concerns trade (the *sale* of goods and services).

Expand full comment

I'm afraid this is not entirely the case. If a model produces Mickey Mouse perfectly in a novel cartoon and you use that in your substack which has paid subscribers, you're not infringing copyright but you are infringing trademark. And I suspect Disney may have to enforce the fact that MJ has produced a 'Mickey Mouse cartoon machine', because if they do not, they are not enforcing their trademark and then finally losing their trademark in another case. So, trademark is indeed different, but I don't think it is entirely outside this discussion.

Expand full comment

Agree, but I thought the point here was wether LLMs should be completely ignorant of the last 100 years or so due to copyright, not whether you can use their output as is.

Expand full comment

They cannot function if they have to ignore the last 70 years or so. But the question is if their purpose is (in the US) 'fair use' and thus covers their free ride on what is on the internet. My suspicion is that the scale and effect of this use (during training) is going to be 'not fair use', in which case they have to pay for the right to use it in the way they need. Even if they pay and this gets resolved, the trademark issue stays and is probably unsolvable.

Expand full comment

I reported on this sometime ago too. And realize just because these images occur does not mean infringement.

Expand full comment
author

That’s actually complicated, but main point is it makes users vulnerable to infringement claims

Expand full comment

Yes I know since I have been working on it far longer than you have. If you want me to explain let me know. Wait till you see how you destroyed patent too. Uggghhh.

Expand full comment

In light of the length Disney has gone to protect Mickey Mouse, I can't imagine how far they will go to protect Star Wars! No need to explain to me yet, as it will be done in a massive amount as the legal profession and points of view gets going! I will just wait and watch the fun!

Expand full comment

sadly, it’s not gonna be fun for the hundreds of thousands of millions of jobs lost. Here in Los Angeles and especially West LA the entire economy is dependent upon real people production and copyright. The destruction of that and the change to our social fabric is going to be incalculable. and that’s why I am so annoyed at these people who are only now discovering that their work is destroying other peoples work.

Expand full comment

hundreds of thousands of millions of jobs in West LA? you sure?

Expand full comment

Perhaps Gary should send his screen shots to the folks at Disney. They might shut this stuff down by next week!

Expand full comment
author

Don’t know the timing on it but I am sure they are working on it

Expand full comment

Yes, this is a complicated space. I recall, from the music industry, a case involving a song by the great Marvin Gaye, that had eerily similar beats and sounds. I think the family eventually won that. Bring in the lawyers and gum up the works ASAP!

Expand full comment

During training the models minimize the difference between their output and the training data, so if the model is well trained it would output stuff very close to the training data under the same conditions that were used during the training, so it could be argued that a very close copy of the training data exists encoded in the model's weights and can be obtained with the right prompting. Isn't that grounds for copyright infringement? It's like a jpegged copy of an image is still a copy even though it's not exact and the image is encoded and to get it one needs a jpeg decoder.

Expand full comment

no, because the way AI deconstruct stuff into curves and lines and angles, it avoids any copyright infringement – it’s literally not copying the thing but parts of the thing. And a second part is that the way it copies isn’t protectable, ideas, lines, concepts, etc. aren’t copyrightable they don’t obtain copyright. In other words, and I highly recommend looking at my sub stack, AI doesn’t make copies that are infringing, and it doesn’t ingest by copying. so copyright is not involved at all.

Expand full comment
Dec 30, 2023Liked by Gary Marcus

Don't get us lost in the weeds. I don't think copyright law cares HOW a work was created, only WHETHER it SEEMS to derive from a copyrighted work. Of course there are grey areas, but they're not specific to AI-generated work.

Expand full comment

No, that’s not right. There are no grey areas. copyright doesn’t prevent or protect what AI does

Expand full comment
author
Dec 30, 2023·edited Dec 30, 2023Author

This is your opinion. Clearly there are others here who disagree. It is logically possible that you are smarter than all the lawyers and experts that NYT engaged but others of us are not persuaded. And the assertion of your controversial view as absolute fact when others of us see room to disagree has become tiresome.

Expand full comment

lawyers like money, not expertise. they would argue that the moon is made of cheese if you paid them

Expand full comment

Hmm, just for fun, let's ask ChatGPT, using your example (from the December 23rd article's comment thread) of Laion5b and your claim that its usage of data from the internet "doesn’t copy any of the images".

Why do this? Because I pointed out that you are blurring together the processing of images with the loading of them. Because it's irrelevant that the databases such as Laion5b that use image from the internet "doesn’t copy any of the images": even if it's just a URL, the image is used as-is at the very beginning of a process. Deconstructing and reconstructing images by a program is a distinct step after they are found and accessed via a link.

Indeed, what CHatGPT 4 “thinks” about it is fairly self-incriminating for these AIs I would say, relative to what you said about them not “copying”! :)

The first prompt:

"How does a dataset that an AI is trained on work? For example with Laion5b, it has links or URLs to images and image-text pairs. Are the image used unaltered at the beginning of the process? That is, are they used as-is and 'copied' in some sense, or are they processed in some way that alters them at the very outset? "

ChatGPT Answer:

"The process of preparing a dataset like LAION-5B for training an AI, especially one focused on image and text understanding, involves several steps. Here's a general overview of how it works:

Collection of Data: The first step is collecting the data. For a dataset like LAION-5B, this often involves scraping the web for images and their associated text. This text could be captions, alt-text, or other relevant descriptions linked to the image.

Filtering and Cleaning: Once the data is collected, it often undergoes a filtering process. This can involve removing inappropriate content, duplicates, or low-quality images. The goal is to ensure that the dataset contains relevant and useful information for the AI to learn from.

Preprocessing: Before being used in training, the data may undergo some preprocessing. For images, this could involve resizing them to a standard size or format, adjusting color balances, or applying other transformations to make the dataset more uniform. Text might be normalized (e.g., converting all text to lowercase, removing punctuation) or tokenized (breaking down into individual words or phrases).

Annotation or Labeling (if needed): In some cases, datasets may require additional labeling or annotation. For instance, if the dataset is intended for object recognition tasks, each image might need labels indicating what objects are present and where they are located.

Data Storage and Organization: The processed data is then organized, often in a format that is easy for machine learning models to access. This could involve pairing each image with its corresponding text and storing them in a database or file system.

Usage in Training: During training, the AI model is exposed to this data. The model learns to understand and generate content by finding patterns in the image-text pairs. This training involves adjusting the model's internal parameters to reduce errors in its predictions or outputs.

In the context of LAION-5B or similar datasets, the images are generally used as-is, in the sense that they are not creatively altered (like adding or removing objects from an image). However, they may undergo technical alterations like resizing or format conversion to make them suitable for the AI training process. The key point is that these alterations are usually about making the data consistent and usable for the model, rather than changing the substantive content of the images.

I then asked it (prompt 2):

"You said 'For a dataset like LAION-5B, this often involves scraping the web for images and their associated text.' From this statement it's not clear whether LAION-5B 'scrapes' the web or not, or what it would mean in such a database to not 'scrape'? "

ChatGPT Answer:

"You're right to seek clarification about the use of the term "scraping" in the context of the LAION-5B dataset. Here's a more detailed explanation:

Web Scraping: This is a process where a program or algorithm automatically browses the web and extracts information from websites. This could include downloading images, text, and other data. Web scraping is often used in building large datasets because it can efficiently gather vast amounts of data from diverse sources across the internet.

LAION-5B and Scraping: The LAION-5B dataset specifically consists of a large collection of image-text pairs. These pairs were indeed gathered from the internet, implying that some form of web scraping or automated data extraction was used. However, the term "scraping" can sometimes carry a broad meaning, ranging from highly automated and indiscriminate data collection to more targeted and curated approaches.

Alternative Methods: In theory, a dataset could be compiled without scraping, such as by manually curating images and texts, or by using data contributed or licensed by users. However, for a dataset as large as LAION-5B, which contains billions of image-text pairs, manual compilation is impractical, and scraping is the most feasible method.

Ethical and Legal Considerations: It's important to note that web scraping, especially at the scale of LAION-5B, raises ethical and legal considerations. This includes respecting copyright laws, privacy concerns, and the terms of service of websites. Dataset creators must navigate these issues carefully to ensure that their data collection practices are responsible and lawful.

In summary, the LAION-5B dataset's construction likely involved web scraping to gather its massive collection of image-text pairs. This method is common for creating large-scale datasets needed for training advanced AI models, particularly in the domain of image and text understanding."

My comment: So yes, AIs scrape and copy. That's how they work.

Expand full comment
Dec 30, 2023Liked by Gary Marcus

I'm not a lawyer. But as other commentators point out, I cannot imagine that a particular method of reproducing copyrighted material or deriving something from copyrighted material matters.

If OpenAI makes a statistical argument of creating the NYT text, their lawyers could make a statistical counter-argument (how likely is it to produce that text accidentally etc.). Can OpenAI demonstrate reliably that their LLM does NOT copy text, images etc.? If they cannot do this, shouldn't the simple appearance be sufficient to decide on violation and make them pay?

Should OpenAI and other LLM providers consider censorship measures, something the Chinese models seem to be required to do (see ChinaTalk, Dec 7, Jordan Schneider and Irene Zhang). Or using trademarks, copyrights, patents during the training phases and either refuse to produce output or make a payment (company or user?)?

Anyway, even though I like playing with LLMs and Dall-E, there seems to be something parasitic about them, reflecting the typical capitalist behavior of privatizing profits and socializing losses. Why should these companies get a Get Out of Jail Free card? Or is IP theft the price to pay for innovation, just an externality? A price to pay in addition to the costs imposed on societies through the projected massive losses of jobs thanks to AI? Neither OpenAI, MS, FB, Google etc. won't shoulder these costs. So, just make them pay.

Expand full comment

Isn't copyright itself part of typical capitalist behaviour? How is it that intellectual property is now such an important asset for anti-capitalists?

Expand full comment

I'm not sure I get your inference, you may be jumping to conclusions. I was merely using the term "capitalist" in a descriptive, non-derogatory way.

(Naively) historically speaking, the history of civilization(s), presumably all of them, is a history of learning from each other, freely copying, mixing ideas etc. Even culture depends on it, see Richardson&Boyd, 2005, "Culture is information capable of affecting individuals' behavior that they acquire from other members of their species through teaching, imitation, and other forms of social transmission." That's why, as a digression, I find the cult around "cultural appropriation" so non-sensical, but maybe this cult is also just the process of claiming a patent, copyright, or trademark without granting licenses, let alone proof of authorship or any other kind of legitimation but mustering digital support or lynch mobs.

That said, maybe focusing on copyright is just focusing on the tip of the iceberg, expressing discontent, queasiness and so on in the face of (a) unfathomable Leviathan.

Nevertheless, giving proper attribution to sources one has used or was influenced by is common practice in many fields, but it is or should also be standard social behavior. Just to be clear, I don't mean this in a beancounterish way or to imply we all have to speak with footnotes, that would be rather silly. I think it's common and common sense not to be happy about (feeling) being robbed. Copyrights, however imperfect they may be, play a role here.

It is a different question though when the alleged violation occurred: during training, fine-tuning, entering the prompt, production of a response to the prompt, presenting the results, or using the results.

Expand full comment

yes, it is a different questions when an alleged violation has occurred during ' training'... like in most schools

Expand full comment

Also... here is a particular method of copyrighted material - the pencil! What do you think we should about it?

As for deriving something from copyrighted material.... well you might want to think about this for a while as this is a lot more common that you might think.

"Good artists copy; great artists steal." - Picasso

Expand full comment

So, this means that if I take somebody else's writing and then change a few words, like "it's literally not copying the thing but parts of the thing" to "it is really not reproducing the thing but parts of the thing", I am not plagiarising them? That is great to know for future reference. I hadn't realised it was that easy to get away with it! Always wanted to be a fantasy author, so stay tuned for my original works Master of the Finger-Jewelry and A Play of Monarchy Chairs.

/s, obvs

Expand full comment

"no, because the way AI deconstruct stuff into curves and lines and angles" - well, jpeg deconstructs the image into the cosine transform patterns (they don't even look like curves or angles, more like meaningless grids) and it's a lossy compression so the image is not restored exactly but it's still considered a copy, so I really don't see your point

Expand full comment
author

Exactly!

Expand full comment

JPEG doesn't deconstruct anything, and you Marcus know this perfectly well so I'm not sure why you are happy with the way the waters are being muddied here.

JPEG is a format to make sure the original image is encoded as faithfully as possible (given loss constraints). The AIs have no interested in representing the image they viewed at all. Apples and oranges.

Expand full comment

Tek Bunny, "The AIs have no interested in representing the image they viewed at all." - well, during training time the model minimizes the difference between its output and the training data, that's how training works, so yes, the "AI" model does have an interest in representing the image it has viewed as training data.

Expand full comment

Tek Bunny, and how do you think this "encoded as faithfully as possible" is achieved?!

"JPEG uses a lossy form of compression based on the discrete cosine transform (DCT). This mathematical operation converts each frame/field of the video source from the spatial (2D) domain into the frequency domain (a.k.a. transform domain). A perceptual model based loosely on the human psychovisual system discards high-frequency information, i.e. sharp transitions in intensity, and color hue."

https://en.wikipedia.org/wiki/JPEG

Expand full comment

And trademark really is a much harder nut to crack here. The 'certainty principle' of genAI here is "memorisation equals training data leakage"

Expand full comment

Good LLM hacking. This will be really hard to fix, all dataset used in training should not have copyright

Expand full comment

That's impossible since you are automatically granted copyright on anything you publish. Look down... you are reading Gary's copyrighted words.

Expand full comment

Copyright is dying thanks to generative AI. Good riddance, I just hope it happens as fast as possible.

Expand full comment

Microsoft's Cash on Hand as of September 2023 : $143.94 B. They will be fine.

Expand full comment