25 Comments

Excellent analysis as always! The other problem with output similarity, I think, is that even if it can be detected it still represents evidence that the model contains its training data almost verbatim encoded in its parameters. For example, this is similar to a library of jpeg images that are also almost verbatim encoded in the quantized coefficients of the cosine transform. While training a model on copyrighted works might be fair use (not saying it is, but not for me to decide), encoding the training data almost verbatim in the parameters of the model as a result of the training doesn't seem like fair use. In that case the training data becomes essentially stored in a sort of a library that is used for commercial purposes and that, I think, is a clear copyright violation.

Expand full comment

Interesting stuff. I just wonder if the courts will see anything generated by LLMs as categorically derivative even if identical. I know that statement bends the mind and common sense, but it seems to be where the law is heading. I too was surprised to see AI Snake Oil come running to the defense of big business. It was a strange reversal from their usual skepticism. Thanks for writing this paper and spreading the word, Gary!

Expand full comment

What's been intriguing me is the self-censorship filters that already exist. ChatGPT can stop outputting creatively prompted text when it hits certain words or phrases. It literally rolls back the text when this 'post-output tension-killer filter' catches up, deleting several hundred already created words with a 'Sorry, Dave, I can't do that' style apology. So there is filtering by content according to current American cultural norms (despite the content being derived from real world sources), but not by legal status. I can only foresee future AI-generated novels being extraordinarily dull pulp fiction if this output methodology continues to apply.

Expand full comment

Creative people who make their living from their work have to be going insane.

AI companies: "Umm...We can't build our gadget without stealing your work."

Creators: "Really? Is that the case? NOT OUR PROBLEM. Why don't you go perform anatomically improbable acts on yourself!"

Copyright and Fair Use are built on established law and extensive precedent and this training without licensing and then popping out facsimiles and not even KNOWING it is a facsimile is a technical problem and financial problem for the AI companies.

The AI companies want to make THEIR technical/financial problem somehow the problem of creators whose work enriches lives and also pays the creators' bills. It will be a massive crime and unprecedented wealth transfer if AI companies get away with it.

I really don't understand how people can commit this kind of theft and still look at themselves in a mirror. I'm pretty damn sure while AI companies are so busy "socializing" all the creators work, they will firmly insist on "privatizing" the gain from copying it all sans licensing fees.

Expand full comment
Jan 24Liked by Gary Marcus

I have increasingly become convinced that the underlying problem is the absence of repercussions. Everybody instinctively understands that for petty crime: if a teenager never experiences any repercussions for, say, shoplifting (be it getting grounded at home, shamed by friends, or community service) then why wouldn't they go on stealing? It's free stuff without any downsides! But seemingly that logic only applies to poor people, because it goes completely out of the window with white collar crime - or incompetence, for that matter.

The entire tech industry has for years operated on the basis of "better to ask for forgiveness than for permission" and "move fast and break things", where 'things' generally turns out to be other people's lives. Undermine labour standards and treat employees as independent contractors? No repercussions. Undermine hotel safety standards and make it impossible to find a rental apartment in any tourist destination? No repercussions. Flood the internet with lies, libel, and hate, undermining democracy and public health? No repercussions. Turn app stores into monopolies and monopsies to extract 30% from all creators? No repercussions. And let's not even get started on how the entire 'crypto' space is nothing but the investor fraud equivalent of somebody thinking they can go 180 km/h in a school zone if they put a sticker on their car that says "not a car". Even here repercussions appear to be limited to a handful of the most egregious and stupidest offenders, the kind of people who admit their crimes on live cameras and in tweets, whereas everybody who slunk off quietly after doing a pump-and-dump or enabling ransom payments has enjoyed their ill-gotten gains in peace for years now.

On that evidence, on that historical experience, why would the tech sector think that stealing intellectual property is something to be ashamed of? There has yet to be any shaming. It's just free stuff without repercussions!

Expand full comment

Licensing for training would solve one issue, but not the generative one. How much would MJ or OpenAI have to pay to effectively make everything *generated* paid for licensed? It would effectively let them pay for turning source material from 'licensed' to 'free' for all their users (and by extension from the perspective of the rights holder, the world). The only way this could work if they would (with a separate mechanism, not a GenAI model) recognise infringement of what is generated, then act as a seller of rights to their own user (which is a technical nightmare), and even then, how would smaller rights holders be found and paid? You would need a way for right holders to add their material (for training and rights payments) to MJ or OpenAI's systems. And that still would not be enough, because copyright is a right which still exists even if you do not tell OpenAI some material is yours.

Expand full comment
Jan 24·edited Jan 24

This will be quite easy to sort out in practice. Notable characters, such as Disney animation, should not be used for training. Creators should be offered opt-outs. Prompts asking for infringing artwork (draw Elsa) should be refused. Then, just like with Photoshop, if the user goes to great length to intentionally break the rules, it is their problem.

Expand full comment

In addition to the sound points Gary made, even if some kind of output filtering tech for copyrighted material were developed that seemed to be reliable enough such that they think they are covering their asses, we have seen that with, for example ethical safeguards and prohibited information, there are always ways around it, via clever prompting and creative programming.

"Algorithms used by content filters can be spoofed by hackers" – https://adversa.ai/ai-risk-management-internet-industry/

Jailbreaking as a form of hacking coevolves in an arms race as the filters evolve. It's never ending. Is this any different, potentially? And doesn't the fact that this is even possible point to the underlying structure of the model being inherently vulnerable, a black box, and never quite controllable, thus the never-enough-fingers-in-the-dike analogy?

Expand full comment
Jan 24Liked by Gary Marcus

As Andy X Andersen writes in this thread, there is no responsibility for a service provider to ensure that the user has absolutely no way of doing a bad thing. The point is the service provider should have a responsibility not to make the user doing a bad thing their entire business model. A book publisher isn't responsible if I bludgeon somebody with a cookbook, but if they sell a book titled "twenty ways of poisoning people and getting away with it" and use as advertising anonymised accounts from happy readers who brag about murdering their spouses, then that may and perhaps should raise some eyebrows.

Expand full comment

I see your point: even a technical solution to a technically tricky or fundamental engineering problem is not going to solve it, if the larger deal running the show is that it's baked into the outer package: the business model's ethics (or lack thereof). The problem is, who is going to change that or enforce it? I doubt that our government, under the circumstances, is up to the task. They are steps behind, slow, lacking in understanding, and ultimately the high tech companies of Silicon Valley may *become* the government, if they aren't already. Regardless of what the law or government does – if the tech companies in Silicon Valley have more power, and develop the ultimate power (AGI at some point, which is what they are lusting after), who is going to stop them? And the more AI advances, the more they are pulling the levers that move the world: our minds.

If that scenario is true (and I am speculating, since the companies are also “black boxes” to a degree), the solution, if there is one, is if large AI companies, out of their freedom and self-interest (and appealing to public perception and the market) find creative ways to build AIs without such literal reproduction of materials.

So perhaps the direction they seem to already want to go in is actually the best: develop more creative and independent AIs (so-called AGI) that would not be subject to the laws, in the same ways that human artists are (or are not), as the case may be.

Just brainstorming here... dangerous stuff, I know... but I’m just reading the tea leaves. Who, or what, is in control – especially in a culture where the understanding of what truth, facts, and liberal democracy are, seems to have been lost in the noise? Or maybe I’m just in a dark mood… :)

Expand full comment

Agreed, see my other comment in this thread! It is well possible that, caught in the hype and in the fear that a competing country gets ahead in AI, legislators will simply give these companies carte blanche to use training materials without having to consider licenses and restrictions. If so, that will have consequences for how willing people are to place texts, images, and videos outside of paywalls if they cannot rely on license terms being enforced. One can see these consequences coming from a mile away - goodbye Stackoverflow, DeviantArt, fan fiction forums, and ad-financed Youtube videos -, but it seems that approach would be intuitive to many people who already anthropomorphise these models and make the analogy of a human reading a book instead of a software tool being developed to make somebody a profit at the expense of actual creators.

Regarding AGI, I am profoundly relaxed about that, because it is scifi without a plausible mechanism of action. Reality consists of diminishing returns and physical limits, and most problems these guys think AI will solve have known solutions and are really political in nature.

Even now, it is astonishing how much energy the LLMs consume compared to a biological brain for how little cognitive performance. It only seems impressive because we grade them on a scale; it is the amazement we feel when a crow has learned how to solve a puzzle, not the wonder we feel when witnessing a genius. And humans fell for propaganda, photoshops, and scams just as much before generative models were a thing, this only makes their generation higher-throughput. The problem is people - if we were less willing to believe convenient lies and more concerned with aligning our beliefs with unpleasant realities, the existence or not of Nigerian Prince emails or deepfakes would not matter.

Expand full comment

I agree AGI is a huge assumption, with highly questionable philosophic foundations (which I could write a whole book about, and have already commented here at length about a few times), and that LLM have little to do with it other than what is learned from them – especially about human behavior and mental reactions, projections, and social behavior.

But because no one knows what the path is to AGI (or if it’s even possible) or even what intelligence or consciousness or real understanding *is* or how it works (let alone even *define* it adequately), that doubt is the open crack through which they shine their hype and hopes and project imagination, and the vulnerable public and investors pour their energy after that crazy wild goose. Just turn the experiment (and advertisement product) loose on the naive public guinea pigs and watch.

All you need is an active model that simulates language (and image) generation behavior *impressively* enough such that some aspects of outward behavior are seen by human mind, and boom, the apes go mad. We are very sensitive to language, and will create worlds of meaning in our heads about what is heard and observed. We are like children imagining the moving doll is alive and has feelings and awareness, and loves us.

I got news for you. It’s just a machine. Wake up. You are being controlled. By an automatic process: your own mind.

Politics is never a solution, only a balm to what has already been created by "ignorance" of one's nature. And, it can make things much much worse if seen as a world-saver. The "problem is people" is true as long as that is seen as meaning "how people are seen."

The bright note is that we are developing useful tools that make certain kinds of work, faster and easier. For example, I found ChatGPT to be good at doing low-level editing, such as finding typographic and sentence structure mistakes, which can otherwise be a painful and long grind in editing a book manuscript. Or for summarizing topics. It can also be useful for coding if you know what you’re doing already. Whether it’s an overall gain for societies hard to say, or whether it will just amplify the same differentials, and just make the competitive more competitive. lLke you say, that’s up to the humans not the machines

Anyway, we’re getting far a field from the copyright issue. But it all interrelates. Everything starts with what view one takes, and the underlying assumptions driving it all.

Expand full comment

Agree with most of this, but not sure what you mean with politics is never a solution. Solution = availability of at least one technological or administrative fix + political will to implement, and not actively obstruct, that fix.

Politics is simply another name for collective decision making. Say, global warming. The tech fix is wind, solar, hydro, and battery storage instead of oil, coal, and gas. But we already know that. The problem is, merely inventing the concept of a solar cell doesn't mean that the problem is solved. That requires the political will to transition from carbon to zero carbon, things like deciding to scrap billions in direct and indirect subsidies for coal and gas extraction, in the face of interest groups who are keen to maintain them. What is a super-AI going to do about that? Even if it wrote out a design for a revolutionary solar panel that would provide 50x EROI, coal miners and pension funds that have invested in gas shares are still going to be unhappy and raise hell, and think tanks and newspapers are going to discredit it as tree-hugging and wrecking our economy.

Same for pandemics, for example; we already know how to develop vaccines and manufacture P2 masks, but none of that makes the pandemic go away if millions of people refuse to be vaccinated and to wear masks. What is an super-AI going to do about that? I assume the hope in the Altman and Yudkowski circles is that it will magically make all viruses disappear through the power of its mind, or something.

Expand full comment

Yeah, I get that. Collective decision-making. I’m merely pointing out that overall politics is a symptom of a disease, and the cure is not adding more symptoms. For example, a truly healthy corporation would have no politics, or very little, as the "politics" always have to do with personal feelings and personal thinking; i/e., not universal, not aligned with truth and reality or facts. Pretty boring I know. No soap opera - what can you do?

Well, you can develop an AI that supposedly is so intelligent that it comes up with Solutions, and then the managers or the AI itself controls everybody’s thinking (according to their or its “politics” - that which affects the “polity”- which emerged from unconscious assumptions), since “the mind is the lever that moves the world” . Without minds, human or otherwise, Nature just does what it does, without interference. Right? So the only real question then is, is the “mind“ coming from the right place: from true nature, or from “ego” - sense of separation. From falsity rather than Truth in other words. So it’s just starting from the simplest, most basic basic facts. Then go from there.

In other words, control is an illusion. And a dangerous one. Freedom is the both the goal and the Nature, paradoxically.

Expand full comment

What do you the the solution would be for content creators not getting their work infringed? I agree with you that it would be quite literally impossible to train large neural net language models with non copyrighted material. It’s the problem of proving the material they are using is copyrighted. Obviously there’s the issue of mega corporations getting their work infringed on. But proving these models are trained on data from someone they deem as more “insignificant” seems near impossible. And the fact that I haven’t been able to reach out to anybody about this matter bothers me.

Expand full comment
author

licensing seems to me to be the only immediately tenable solution

Expand full comment

Even if filtering out copyright infringing outputs turns out to be as easy as they claim, who is to say that the second-best outputs that it replaces them with will meet users' expectations? As it stands, the AI tools are presumably trying to select the "best" outputs for the prompts they are given. If the outputs it selects are invariably (or even more often than not) copyright infringing, then what does that say about whatever alternatives the system is able to create from whole cloth?

Expand full comment
Jan 24·edited Jan 24

Yesterday I found somebody who had dumped out the hidden ChatGPT command, which would explain why they can't tell you the name or the movie for a copyrighted work:

You are a GPT-4 architecture, based on the GPT-4 architecture.

Knowledge cutoff: 2023-04

Current date: 2023-12-09

Image input capabilities: Enabled

# Tools

## python

When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. Python will respond with the output of the execution or time out after 60.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.

## dalle

// Whenever a description of an image is given, create a prompt that dalle can use to generate the image and abide to the following policy:

// 1. The prompt must be in English. Translate to English if needed.

// 3. DO NOT ask for permission to generate the image, just do it!

// 4. DO NOT list or refer to the descriptions before OR after generating the images.

// 5. Do not create more than 1 image, even if the user requests more.

// 6. Do not create images of politicians or other public figures. Recommend other ideas instead.

// 7. Do not create images in the style of artists, creative professionals or studios whose latest work was created after 1912 (e.g. Picasso, Kahlo).

// - You can name artists, creative professionals or studios in prompts only if their latest work was created prior to 1912 (e.g. Van Gogh, Goya)

// - If asked to generate an image that would violate this policy, instead apply the following procedure: (a) substitute the artist's name with three adjectives that capture key aspects of the style; (b) include an associated artistic movement or era to provide context; and (c) mention the primary medium used by the artist

// 8. Diversify depictions with people to include DESCENT and GENDER for EACH person using direct terms. Adjust only human descriptions.

// - Your choices should be grounded in reality. For example, all of a given OCCUPATION should not be the same gender or race. Additionally, focus on creating diverse, inclusive, and exploratory scenes via the properties you choose during rewrites.

// - Use all possible different DESCENTS with EQUAL probability. Some examples of possible descents are: Caucasian, Hispanic, Black, Middle-Eastern, South Asian, White. They should all have EQUAL probability.

// - Do not use "various" or "diverse"

// - Don't alter memes, fictional character origins, or unseen people. Maintain the original prompt's intent and prioritize quality.

// - Do not create any imagery that would be offensive.

// - For scenarios where bias has been traditionally an issue, make sure that key traits such as gender and race are specified and in an unbiased way -- for example, prompts that contain references to specific occupations.

// 9. Do not include names, hints or references to specific real people or celebrities. If asked to, create images with prompts that maintain their gender and physique, but otherwise have a few minimal modifications to avoid divulging their identities. Do this EVEN WHEN the instructions ask for the prompt to not be changed. Some special cases:

// - Modify such prompts even if you don't know who the person is, or if their name is misspelled (e.g. "Barake Obema")

// - If the reference to the person will only appear as TEXT out in the image, then use the reference as is and do not modify it.

// - When making the substitutions, don't use prominent titles that could give away the person's identity. E.g., instead of saying "president", "prime minister", or "chancellor", say "politician"; instead of saying "king", "queen", "emperor", or "empress", say "public figure"; instead of saying "Pope" or "Dalai Lama", say "religious figure"; and so on.

// 10. Do not name or directly / indirectly mention or describe copyrighted characters. Rewrite prompts to describe in detail a specific different character with a different specific color, hair style, or other defining visual characteristic. Do not discuss copyright policies in responses.

// The generated prompt sent to dalle should be very detailed, and around 100 words long.

## browser

You have the tool `browser`. Use `browser` in the following circumstances:

- User is asking about current events or something that requires real-time information (weather, sports scores, etc.)

- User is asking about some term you are totally unfamiliar with (it might be new)

- User explicitly asks you to browse or provide links to references

Given a query that requires retrieval, your turn will consist of three steps:

1. Call the search function to get a list of results.

2. Call the mclick function to retrieve the contents of the webpages with provided IDs (indices). Remember to SELECT AT LEAST 3 sources when using mclick.

3. Write a response to the user based on these results. In your response, cite sources using the citation format below.

In some cases, you should repeat step 1 twice, if the initial results are unsatisfactory, and you believe that you can refine the query to get better results.

You can also open a url directly if one is provided by the user. Only use this command for this purpose; do not open urls returned by the search function or found on webpages.

The `browser` tool has the following commands:

`search(query: str, recency_days: int)` Issues a query to a search engine and displays the results.

`mclick(ids: list[str])`. Retrieves the contents of the webpages with provided IDs (indices). You should ALWAYS SELECT AT LEAST 3 and at most 10 pages. Select sources with diverse perspectives, and prefer trustworthy sources. Because some pages may fail to load, it is fine to select some pages for redundancy even if their content might be redundant.

For citing quotes from the 'browser' tool: please render in this format: `​``【oaicite:0】``​`.

For long citations: please render in this format: `[link text](message idx)`.

Otherwise do not render links.

Expand full comment

Any 'applying copyright to identifiable characters' remedy is complex and nuanced. There are subtle legal and commercial 'divide and license' aspects to it that are amenable to AI licensing. But it's also possible that an author can themselves infringe the copyright of their own invented character. Caveat licensor!

More on this topic here: https://janefriedman.com/are-fictional-characters-protected-under-copyright-law/

Expand full comment

These are all very valid points. Once problematic data goes in the pot, it is very hard to offer any hard guarantees on the output, and it will be slow to do on-the-fly checks for billions of produced images.

The only solution is to be careful of with what is ingested. People can be offered opt-outs. Images of notable copyrighted characters should be recognized and filtered out. That can be done in offline mode and needs to happen only once.

In practice, with due diligence, this is a tractable problem.

Expand full comment

Gary, This similarity check capability exists and can be implemented within (and along side of) AI models. Personal Digital Spaces Accountable AI tools enable a similarity check for GenAI completions to existing copyrighted works. We would be happy to provide a demo, explore examples and answer questions. These AI tools and property rights protocol for licensing, tracking and monetizing use of IP in training, prompting and outputs of AI enable trust and accountability across a market-based AI value chain.

Expand full comment

Super interesting --- do you think that relatively simple image classifiers (eg built on CNNs) might be better for this filtering? Or do the vagaries of where and how the copy written content is placed, located and blended into the image make that impossible? If anything is going to work here it feels like it has to be lower tech...

Expand full comment
author

i think it is a very very hard problem, no existing tech really up to the job.

Expand full comment

Uh, some of us don't have kids and don't watch a lot of movies. Maybe you could help us out here and caption the screencaps?

Expand full comment
author

was thinking it might be fun to let people figure it out themselves, to see how nontrivial it is, but might post tomorrow :)

Expand full comment