The new study seems to lend the trust in the claim that Openai trained some of the AI models of copyrighted content.
Openai accuses us to develop a model without permission by the author, programmer and other rights holders, and to develop a company (books, codebase, etc.) without permission. Openai has long claimed to defend fair use, but the plaintiff insists that there is no pioneering US copyright law on educational data.
This study, co -written by researchers at the University of Washington, the University of Copenhagen and Stanford, suggests a new way to identify “memorized” training data by models of APIs such as Openai.
The model is a predictive engine. Learn patterns of education for many data. This is a way to create essays, photos, etc. Most outputs are inevitably part of the way of “learning”, not a copy of the training data. The image model turned out to reflux screenshots in the educated film, and the language model was observed to plagiarize the news article effectively.
The method of this study depends on the words that the co -author calls “high public officials”, that is, in the context of larger works. For example, the word “radar” in the sentence, “Jack and I was perfectly sitting with radar humming,” is considered a statistically smaller chance before it appears before the word “engine” or “wireless”.
The co-author memorized several OpenAI models, including GPT-4 and GPT-3.5 by attempting to “guess” the words that were masked from fiction books and New York Times sculptures and to “guess” the model masked. If the model could guess correctly, it is likely that he memorized the snippet during training, and the co -author concluded.

According to the test results, the GPT-4 showed signs of memorizing some of the popular novel books, including a book of a data set, including a copyright eBook sample called Bookmia. The result also suggested that the model memorized some of the New York Times article at a relatively low speed.
Abhilasha Ravichander, a Ph.D. student at the University of Washington and a co -author of the study, told TechCrunch that the findings of the “controversial data” model may have been trained.
“To have a reliable language model, we need a model that can be scientifically investigated, thankful and investigated.” Our work aims to provide a tool to investigate a large language model, but it requires greater data transparency in the entire ecosystem. “
Openai has long advocated a loose restriction on model development using copyright data. The company has an opt -out mechanism that has a specific content license transaction and a copyright owner can flag the contents that prefer a company that does not use for educational purposes, but several governments lobbyed several governments to organize the “fair use” rules for AI education approach.