Premium publishers had their data scraped more than we thought

by · Android Headlines

A major topic in AI is how AI companies gather data to train their models. Companies like The New York Times are suing OpenAI and Microsoft for scraping its content to train ChatGPT. While these companies extract the majority of their data from publicly available sources, it seems that they gather data from more premium publishers than we’d think.

AI companies using pay-walled content to train their models is still in a legal gray area. It’s debated whether this is technically copyright infringement. If the chatbot in question reproduces entire sections of the paid content, then that could be grounds for a lawsuit. This is one reason for the New York Times lawsuit. It’s also why AI companies are looking to cut deals with so many publishers. This is to avoid legal troubles among other reasons. The only issue is that these AI companies were most likely scraping pay-walled data long before the publications knew about it.

AI companies scrape more data from premium publishers than many think

A new report from Ziff Davis (via Axios) has just shed some light on how much premium content AI companies have scraped. For the report, co-authors George Wukoson and Joey Fortuna analyzed several LLMs and the content used to train them. What they found was that a large amount of the data used to train some of the largest models came from 15 premium publications.

One major example was GPT-2, which was trained by OpenAI. The researchers took an open-source replica of the OpenWebText dataset, which OpenAI used to train the model. They found that about 10% of the information in that dataset came from premium websites. Other datasets used to train older models also used a ton of data from premium sites.

This means that some of the older LLMs (probably models that never powered user-facing chatbots) consisted of a significant amount of information from premium sites. While that’s the case, the report found that some of those older datasets are still being used to train newer models. This means that models could still be using pay-walled material.

So, while several publications have been making deals with AI companies, the AI models powering many of the most powerful chatbots on the market are still using information taken from pay-walled content.