TylerM on Marcus on AI

5 Comments

It may be an empirical hypothesis but having LLMs produce synthetic data that is then used to train more powerful LLMs *seems* like it should violate some fundamental law.

That's too close to alchemy.

Expand full comment

There is actually research showing that training LLMs on output from LLMs leads to model collapse after just a few iterations.

One would think that the developers of LLMs would be very concerned about such things.

Expand full comment

The problem is actually not restricted to “synthetic data” produced specifically for training purposes but in fact, given that the web is now being flooded with LLM generated data, simply training on random data from the web going forward will inevitably have the same result.

So , LLMs require more data but more data generated by LLMs will actually make them worse.

Quite the pickle.

Expand full comment

Like (5)