With more and more content generated by AI, a new peril has people who study these things buzzing, so-called “model collapse”.
This could occur when the content being used to train generative AI models is itself AI generated.
Think of this problem as like “Telephone”, the popular kids’ game. I’m not sure if the plugged-in youngest generation still plays Telephone, but as I remember it, one child whispers a sentence into another child’s ear. Then the sentence gets passed from child to child until the last person in line repeats what she heard. It’s usually very different from what the first child reports that they started with. Hilarity results as everyone recalls what they remember.
That’s what AI model collapse might look like. The more AI generated content there is, the more likely that models will be trained on AI generated content. And with that a likelihood that the content being spit out is gobbledygook, like the last kid at Telephone.
Millions of blog posts and web pages are now AI generated. It’s likely that thousands of new books listed on Amazon and other sites are also AI generated.
One report suggest that "as much as 90 percent of online content may be synthetically generated by 2026".
The only reason this book was caught was that it was so obvious because of its timeliness. I’m sure thousands of other AI generated books aren’t taken down.
An obvious solution to the problem is to not train new AI models on AI generated content. However, that’s likely impossible, because so much newly created content is AI generated.
Another solution, sticking to the existing AI models that were created in 2022 or earlier, before the rise of ChatGPT. But this won’t work either because people need new information to query. That approach would be like checking the old family encyclopedia published a decade earlier. It’s fine if you want to know about sloths or fishing, but worthless for anything current like current events, sports results, or politics.