
Lawyers for the New York Times and Daily News, which sued OpenAI for allegedly scraping their work to train AI models without permission, said OpenAI engineers mistakenly deleted data that may have been relevant to the case.
Earlier this fall, OpenAI agreed to provide two virtual machines to lawyers for The Times and Daily News to help them search for copyrighted content in its AI training sets. (A virtual machine is a software-based computer that exists within the operating system of another computer, often used for testing, data backup, and running apps.) In the letter, the publisher’s lawyers say they and the experts they hired incurred expenses. . We searched over 150 hours of OpenAI’s training data since November 1st.
However, OpenAI engineers deleted all of the publisher’s browsing data stored on one of its virtual machines on November 14, according to the aforementioned letter filed late Wednesday in the U.S. District Court for the Southern District of New York.
OpenAI attempted to recover the data and was largely successful. But because the folder structure and file names have been “irretrievably” lost, the recovered data “cannot be used to determine where the articles copied by the news plaintiff were used to build (OpenAI’s) model,” according to the letter. It is said.
“News plaintiffs were forced to rewrite their work from scratch, using significant labor costs and computer processing time,” lawyers for The Times and Daily News wrote. “The News Plaintiff only learned yesterday that the recovered data will not be available and will require an entire week’s worth of expert and attorney work to be done again. That is why this supplemental letter was filed today.”
Plaintiffs’ attorneys made it clear that there was no reason to believe the deletion was intentional. But they said the incident highlights that OpenAI is “best positioned to scan its own data sets” for potentially infringing content using its own tools.
An OpenAI spokesperson declined to provide a statement.
But late Friday, Nov. 22, OpenAI’s lawyers filed a response to a letter sent Wednesday by lawyers for The Times and Daily News. In response, OpenAI’s lawyers explicitly denied that OpenAI deleted the evidence and instead suggested that the plaintiffs were responsible for system configuration errors that led to the technical issues.
“Plaintiffs requested a configuration change to one of several machines provided by OpenAI to retrieve training datasets,” OpenAI’s attorneys wrote. “However, implementing the changes requested by the plaintiff removed the folder structure and some file names from one hard drive, the drive that was supposed to be used as a temporary cache. In any case, there is no reason to think any files would have been corrupted. In fact, they were lost. .”
In this and other cases, OpenAI has argued that its training models using publicly available data, including articles from The Times and Daily News, were fair use. That means that when it comes to creating models like GPT-4o, which generate human-sounding text by “learning” from billions of examples of e-books, essays, etc., OpenAI believes it doesn’t have to license or pay for them. Yes – even if you make money with that model.
That said, OpenAI has signed licensing deals with a growing number of new publishers, including the Associated Press, Business Insider owner Axel Springer, the Financial Times, People parent company Dotdash Meredith and News Corp. OpenAI rejected these terms. The deals are public, but one content partner, Dotdash, reportedly receives at least $16 million per year.
OpenAI has neither confirmed nor denied that it trained its AI systems on specific copyrighted works without permission.
Update: Added OpenAI’s response to the claims.









