It may be an empirical hypothesis but having LLMs produce synthetic data that is then used to train more powerful LLMs *seems* like it should violate some fundamental law.
It may be an empirical hypothesis but having LLMs produce synthetic data that is then used to train more powerful LLMs *seems* like it should violate some fundamental law.
The problem is actually not restricted to “synthetic data” produced specifically for training purposes but in fact, given that the web is now being flooded with LLM generated data, simply training on random data from the web going forward will inevitably have the same result.
So , LLMs require more data but more data generated by LLMs will actually make them worse.
It may be an empirical hypothesis but having LLMs produce synthetic data that is then used to train more powerful LLMs *seems* like it should violate some fundamental law.
That's too close to alchemy.
There is actually research showing that training LLMs on output from LLMs leads to model collapse after just a few iterations.
One would think that the developers of LLMs would be very concerned about such things.
The problem is actually not restricted to “synthetic data” produced specifically for training purposes but in fact, given that the web is now being flooded with LLM generated data, simply training on random data from the web going forward will inevitably have the same result.
So , LLMs require more data but more data generated by LLMs will actually make them worse.
Quite the pickle.
“AI models collapse when trained on recursively generated data”(published in Nature)
https://www.nature.com/articles/s41586-024-07566-y
Wait a minute. I saw something on exactly that a couple years ago. When you do that, it results in a degenerative cycle that produces wave patterns.