Marcus on AI

"We won't get fabulously rich if you don't let us steal, so please don't make stealing a crime."

Drop the Mic GM. That one line sums up everything! I love the complex simplicity of it.

Expand full comment

brutalist

"nobody is actually suggesting that OpenAI only use public domain works"

Maybe they should!

Expand full comment

Reply (2)

i am fine if they license other stuff. or they can put the shoplifted stuff back on the shelf. either way is cool

Expand full comment

Tek Bunny

Do you think that everything on the internet should be paywalled by default, and if not the considered public domain? Should you sign a license agreement with every web page owner whose page you read?

Expand full comment

mjq66

Jan 25, 2024Edited

Most web pages you read DO have a license agreement that restricts how you can use their content. I don't know where this concept came from that "Someone published it on the Internet, therefore they've abandoned their copyright and I can legally use it however I want".

Expand full comment

Tek Bunny

Jan 26, 2024

Most? Care to back that up with some evidence. Anyway, this is not relevant. Copyright deals with derivative works, it does not give you any rights to dictate anything you like. This has been understood on the web ever since the attempt to remove custom CSS failed.

Expand full comment

mjq66

Apr 9

> Most? Care to back that up with some evidence.

Sure, just look through your own browsing history and then look at the bottom of each page for a "Terms" link or the like. This page has one, for instance.

"Using Substack in any way means that you agree to all of these Terms, and these Terms will remain in effect while you use Substack. These Terms include everything in this document, as well as those in the Privacy Policy, Publisher Agreement, Content Guidelines, and Copyright Dispute Policy. If you don’t agree to all of the following, you may not use or access Substack in any manner."

> Copyright deals with derivative works

Yes, as well as direct copies. OpenAI makes direct copies without permission and then trains derivative works without permission, which is a violation of copyright.

> it does not give you any rights to dictate anything you like.

Yes, it does, unless you can claim Fair Use. OpenAI could claim Fair Use when they were a non-profit training research projects, but that no longer flies after you turn it into a commercial product. Read about the Four Factors that determine whether a Use is Fair.

OpenAI loves to use the term "publicly available", as if being published on the web gives them the right to scrape it and use it for whatever for-profit purpose they want. It does not.

Expand full comment

Gerben Wierda

And it is rather unsolvable too. Users *want* the systems to be able to produce correct information. But the better the model is at producing correct results (based on correctness of what was in the training material) the more will it memorise and the bigger the plagiarism problem. https://ea.rna.nl/2023/12/26/memorisation-the-deep-problem-of-midjourney-chatgpt-and-friends/

Expand full comment

TheOtherKC

I was betting money that 2024 would see the start of an AI winter. A bet I made because I thought the limits of LLMs would become insurmountable and development outside of refining existing use cases would plateau. Now I'm just glad nothing in the wager said anything about the cause: AI companies having to grapple with the owners and creators vast troves of data they have to consume will just make LLMs' flaws more relevant.

Expand full comment

Roman Peczalski

Your post could be entitled ‘the desperate race for massive undue profits from genAI’. The more I read about this plagiarism issue with today’s genAI tools, the more I find amazing that global scale robbery of IP is going on without a larger echo from media and political spheres. And that bad faith and ill will declarations from major concerned genAI companies are considered as reliable technical expert opinions and not directly denounced as fraudulent marketing rhetoric.

Expand full comment

Gerald Harris

Thank you again Gary!! Great information.

To get a little perspective on this period we are in with AI and LLMs, I read two books about the early days of airplanes: John Lancaster's "The Great Air Race," and "The Wright Brothers" by David McCullough. What I think can be learned from this period of innovation around a world-changing technology, was how much additional work, practice and innovation was needed before air flight had broad use by the government (military) and public/private business. Risks were taken in the early phases (like flying without air pressurization or seat belts or navigation systems) that no one in their right mind would do today. Even the Wright Brothers talked about what they were doing were demonstrations of the some core principles upon which further engineering and innovation could take place. That further innovation was part of its integration and use in broader applications. They knew early on there were serious risks. "The Great Race" book is almost comical in the descriptions of the accidents (running into mountain ranges). I think what is being suggested for AI/LLM and the like by Gary and others is this same careful development with additional creativity and innovation to address problems in actual real world use. But as he points out, the venture-backed financing doesn't want to experience the early losses, and want to tilt the law to protect them. Historically this kind of thing has been tried, but mostly failed as innovation and creativity were used to address the real world problems (seat belts, anti-lock braking, materials innovation, wind-tunnel testing, etc.) and better products won out in the market. The equivalent of safety, environmental, and other kinds of rules and regulations are needed for AI/LLMs and we need to get on with it so we have better products and services at fair prices with reasonable returns to investors.

Expand full comment

💯

Expand full comment

Amy A

If GenAI is generating so much value for society that these companies deserve to have rules rewritten to serve its needs, it should be generating enough value to compensate publishers. It's incredibly short sighted for them to risk putting those creating actual new content out of business. Absolutely maddening.

Separately, I appreciated the comment in the IEEE on whether increases in accuracy accompany increases in the likelihood of creating plagiaristic content, as this is what I've been wondering about looking at the Midjourney images.

Expand full comment

TheOtherKC

> But it would be interesting to think about what copywrite law would be like if humans had the ability to memorize entire books and recite them with prompted to do so.

This is super-pedantic and doesn't really undermine the point, but: humans do have this ability. However, becoming a hafiz -- one who has memorized the Koran -- is a serious endeavor, a lifetime task, and that for only a single holy book. The mass data consumption of an AI is a very different matter.

Expand full comment

Reply (2)

your nice point led me to this which others might also went to peek at https://en.wikipedia.org/wiki/Hafiz_(Quran)

Expand full comment

Cristian Georgescu

My guess is that if you have to buy a book and then memorize it, that should not be a problem. But if you post your memories to lots of users, either for free, but especially if you charge for it, then that will be a problem.

Expand full comment

Terry Cook

What about patent infringement? Where lies my liability when using AI and it provides a patented process, but no reference to the patent. User beware!

Expand full comment

Tek Bunny

How is this any different the you finding a description of the process on Google with no reference to the patent? If you just blindly go to the patent office you will have not done your due diligence and be legally liable. AI changes nothing here.

Expand full comment

Terry Cook

More work for patent attorneys.

Expand full comment

John Richmond

Love this bang-on post from GM and for once I read all the comments before posting. Great comments below. They must have seen this coming - why does it feel they are in damage control mode? Why were they not out ahead of the issue? I know the answer... =)

Expand full comment

Jan 10, 2024

good question!

Expand full comment

Ghatanathoah

I just randomly asked Copilot to generate some images of space marines doing stuff. All the space marines looked like the ones from "Warhammer 40K." I kind of expected that some would, but I also expected that some might look like the ones from "Aliens" or from the many other science fiction franchises. Nope, all Warhammer. I tried a few prompts, including specifically saying for them to not look like the ones from Warhammer. I also asked for them to not be wearing battle armor, that did not work either. Searching for "space troopers" gave two images of generic battle spacesuits and two of Imperial Storm Troopers from Star Wars.

I kind of assumed that a lot of these allegations of copyright infringement were fishing for it by asking specific prompts. But "space marine" is not a concept unique to the Warhammer franchise. I think certain image-word associations are so ubiquitous online that it's impossible to get them out of the databases of these image generators. I assume it's the same for some text combinations as well.

Expand full comment

Myles Dear

Jan 8, 2024Edited

To be fair, as a software engineer, I mostly use ChatGPT to pull in public-domain knowledge that I lack to accomplish specific tasks. None of the links it comes up with for any of my prompts point to any paywall of any sort. It helps save me gobs of time and makes me more effective and efficient because it's able to take large amounts of information via web plugins, munge it together, and spit it out in the form I need. I feel I'm getting my money's worth.

With that said, I do empathize with copyright holders and feel sad that the same tool has crossed those lines. If all copyrighted content was pulled from ChatGPT I wouldn't shed a tear. Also, if I wanted ChatGPT to access copyrighted data, I wouldn't mind paying an optional fee for the privilege (think Spotify and how one monthly fee gets distributed to the copyright owners whose content you actually use).

In fact, Github CoPilot has already been dinged for providing publicly accessible code in its responses and now it's possible to set it to generate original content only and not regurgitate public code verbatim. Anything is possible, if there's the political will to do so. We should continue to push back to our AI providers.

Expand full comment

A C

LLMs need a business model -- and they are only one step away from getting one.

No content creator is going to allow LLMs to train on their content and produce a trilion-dollar company without getting anything in return. Google is a trilion-dollar company because it brings as much value to content creators as well. Some even chose to pay money to Google to get their content scanned! (advertisers)

LLMs just need to accurately cite their sources. That's it. A trilion-dollar business in front of you.

Expand full comment

Ghatanathoah