Discussion about this post

User's avatar
Roumen Popov's avatar

Excellent analysis as always! The other problem with output similarity, I think, is that even if it can be detected it still represents evidence that the model contains its training data almost verbatim encoded in its parameters. For example, this is similar to a library of jpeg images that are also almost verbatim encoded in the quantized coefficients of the cosine transform. While training a model on copyrighted works might be fair use (not saying it is, but not for me to decide), encoding the training data almost verbatim in the parameters of the model as a result of the training doesn't seem like fair use. In that case the training data becomes essentially stored in a sort of a library that is used for commercial purposes and that, I think, is a clear copyright violation.

Expand full comment
Nick Potkalitsky's avatar

Interesting stuff. I just wonder if the courts will see anything generated by LLMs as categorically derivative even if identical. I know that statement bends the mind and common sense, but it seems to be where the law is heading. I too was surprised to see AI Snake Oil come running to the defense of big business. It was a strange reversal from their usual skepticism. Thanks for writing this paper and spreading the word, Gary!

Expand full comment
21 more comments...

No posts