Excellent analysis as always! The other problem with output similarity, I think, is that even if it can be detected it still represents evidence that the model contains its training data almost verbatim encoded in its parameters. For example, this is similar to a library of jpeg images that are also almost verbatim encoded in the quantiz…
Excellent analysis as always! The other problem with output similarity, I think, is that even if it can be detected it still represents evidence that the model contains its training data almost verbatim encoded in its parameters. For example, this is similar to a library of jpeg images that are also almost verbatim encoded in the quantized coefficients of the cosine transform. While training a model on copyrighted works might be fair use (not saying it is, but not for me to decide), encoding the training data almost verbatim in the parameters of the model as a result of the training doesn't seem like fair use. In that case the training data becomes essentially stored in a sort of a library that is used for commercial purposes and that, I think, is a clear copyright violation.
Excellent analysis as always! The other problem with output similarity, I think, is that even if it can be detected it still represents evidence that the model contains its training data almost verbatim encoded in its parameters. For example, this is similar to a library of jpeg images that are also almost verbatim encoded in the quantized coefficients of the cosine transform. While training a model on copyrighted works might be fair use (not saying it is, but not for me to decide), encoding the training data almost verbatim in the parameters of the model as a result of the training doesn't seem like fair use. In that case the training data becomes essentially stored in a sort of a library that is used for commercial purposes and that, I think, is a clear copyright violation.