The Washington Post released a fascinating analysis of how AI chatbots gather content on the public Web. The report, Inside the secret list of websites that make AI like ChatGPT sound smart (subscription required) is a fascinating read.
I was especially interested to see that the analysis includes a tool to check if your own website data is being used as an input to train Google’s C4 data set (Colossal Clean Crawled Corpus), a large language model like ChatGPT that helps power Google Bard.
The analysis ranked the roughly 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
Business and industrial websites made up the biggest category of content in the Google’s C4 data set (16 percent of categorized tokens). Google’s C4 data set also includes more than half a million personal blogs (3.8 percent of categorized tokens).
Many people are concerned that these AI models harvest their data. They see it as “stealing” because the content is used without attribution. As a writer, I can certainly understand that.
Let’s dig a little deeper
While the source of data used in AI generated results isn’t yet reported, I firmly believe over time AI companies will list where the data in a specific response comes from.
Perhaps, governments will require reporting. Perhaps there will eventually be a way for a website owner to opt-in or opt-out of having their data used. I suspect that soon, AI companies will volunteer the source of data used in a response.
Chatbots are the new search
No matter how it happens, being part of chat responses will become valuable, just like being at the top of search engine results are valuable today.
Two of my URLs are included in the Google C4 dataset - DavidMeermanScott.com (where my blog is hosted) and newsjacking.com. I already knew that both sites are also included in the ChatGPT dataset because when I enter specific queries, the resulting answers clearly pull from my content.
Today, companies are investing billions into surfacing content on search engines like Google via paid search ads and optimized content.
In the future, if your content is used to train AI and the chatbots include the websites accessed in their responses, that becomes a new way to generate attention for your content.
AI presents a new world with many opportunities! It’s fun to think about what’s coming next and to play around with what’s available now.