Skip to main content

Recent research appears to support claims that OpenAI utilized copyrighted material to train certain AI models, as evidenced by a study that has shed new light on the matter.

OpenAI is currently facing lawsuits from authors, programmers, and other rights-holders who allege that the company used their work, including books and codebases, without permission to develop its models. OpenAI has consistently maintained a fair use defense, whereas the plaintiffs argue that U.S. copyright law does not provide an exception for training data.

A team of researchers from the University of Washington, the University of Copenhagen, and Stanford collaborated on the study, which proposes a novel approach to identifying training data that has been “memorized” by models accessed through an API, such as OpenAI’s.

Models are essentially prediction engines that learn patterns from large amounts of data, enabling them to generate content like essays, images, and more. Although most generated outputs are not direct copies of the training data, some inevitably are. For instance, image models have been known to reproduce screenshots from movies used in their training, while language models have been found to plagiarize news articles.

The study’s methodology relies on identifying words with high “surprisal” values, which are words that stand out as unusual in the context of a larger body of work. For example, in the sentence “Jack and I sat perfectly still with the radar humming,” the word “radar” would be considered high-surprisal due to its relatively low probability of appearing before “humming” compared to other words like “engine” or “radio.”

The researchers tested several OpenAI models, including GPT-4 and GPT-3.5, for signs of memorization by removing high-surprisal words from snippets of fiction books and New York Times articles and having the models attempt to “guess” the missing words. If the models were able to guess correctly, it suggested that they had memorized the snippets during training.

OpenAI copyright study
An example of a model attempting to “guess” a high-surprisal word.Image Credits:OpenAI

The test results showed that GPT-4 exhibited signs of having memorized portions of popular fiction books, including those from the BookMIA dataset, which contains samples of copyrighted ebooks. Additionally, the results suggested that the model memorized portions of New York Times articles, albeit at a lower rate.

Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study, discussed the findings with TechCrunch, stating that they provide insight into the potentially “contentious data” used to train models.

“To develop trustworthy large language models, we need to be able to probe, audit, and scientifically examine them,” Ravichander explained. “Our research aims to provide a tool for probing large language models, but there is a pressing need for greater data transparency throughout the entire ecosystem.”

OpenAI has long advocated for more lenient restrictions on developing models using copyrighted data, and while the company has certain content licensing agreements in place and offers opt-out mechanisms for copyright owners, it has lobbied governments to establish “fair use” rules for AI training approaches.


Source Link