Home Technology Researchers propose the Openai Training AI model for Paywalled O’Reilly book.

Researchers propose the Openai Training AI model for Paywalled O’Reilly book.

Researchers propose the Openai Training AI model for Paywalled O’Reilly book.

Openai has been accused of training AI for the copyright sans authority. Now, the new paper of the AI ​​Watchdog organization has been seriously criticized for the company that there is no license to train a private book and train more sophisticated AI models.

The AI ​​model is essentially complex predictive engine. Books, movies, TV shows, etc. have been trained for many data -they learn patterns and new methods that can be estimated in simple prompts. When the model “writes” an essay for the tragedy of Greece or “draws” an image of a Ghibli style, it simply draws from vast knowledge to approximation. It does not arrive at something new.

Many AI laboratories, including Openai, have begun to accept AI production data as they have exhausted actual sources (mainly open webs), but few completely replace the actual data. It may be because training on purely synthetic data occurs such as the worsening of the performance of the model.

In the AI ​​public project, a non-profit organization co-established co-founded by the media big O’Reilly and the Economist ILAN Strauss, the new paper concluded that OpenAI would train the GPT-4O model in O’Reilly Media’s Paywalled Books. I derived. (O’Reilly is the CEO of O’Reilly Media.)

GPT-4O is the default model in CHATGPT. O’Reilly has no license agreement with Openai.

The co-author of this paper said, “GPT-4O, the latest and competent model of Openai, shows a strong perception of Paywalled O’Reilly Book Content compared to Openai’s early model GPT-3.5 Turbo. O’Reilly Book shows a relative perception of the sample. “

The paper used a method called DE-COP, first introduced in the journal in 2024 and was designed to detect copyright content in language model education data. This method, also called “membership reasoning attack,” tests the model that the model can stably distinguish between the same text AI creation version and human text. If possible, it suggests that the model can have prior knowledge of text in educational data.

The co-author of this paper, O’Reilly, Strauss and AI researchers, says the SRULY ROSENBLAT has investigated the knowledge of the GPT-4O, GPT-3.5 Turbo and other OpenAI models published before and after the educational cut off date. They used 13,962 paragraph extracts from 34 O’Reilly books to estimate the possibility of being included in the educational data set of a particular excerpt model.

According to the results of this paper, GPT-4O has “recognized” O’Reilly book content, which is much more concessions than the previous model of Openai, including the GPT-3.5 turbo. The authors said, even after explaining the potential confusion factors, the new model’s ability to identify whether the text was written.

“GPT-4O (perhaps) recognizes the prior knowledge of many private books published before the training cut off date and has pre-knowledge.

It is not a smoking gun, and the co -author must pay attention. They admit that the experiment is not perfect and that Openai may have collected Paywalled Book excerpt from the user who copies and pastes to Chatgpt.

The co-author did not evaluate the latest models of Openai while further muddy water, which includes “reasoning” models such as GPT-4.5 and O3-Mini and O1. This model may have been educated in Paywalled O’Reilly Book data or less than GPT-4O.

In other words, it is not a secret that Openai, which advocates loose restrictions on model development using copyright data, has been looking for high -quality education data for a while. The company even hired journalists to fine -tune the production of the model. This is a trend throughout the broad industry. The AI ​​company recruits experts from domains such as science and physics, allowing these experts to effectively supply their knowledge to the AI ​​system.

It should be noted that Openai pays at least for education data. The company has license transactions with news publishers, social networks, stock media libraries and others. Openai also provides an incomplete mechanism, but copyright owners offer an opt -out mechanism that can flag the contents that prefer companies that are not used for educational purposes.

Nevertheless, OpenAI does not have the most flattering O’Reilly papers as educational data practices and US courts have filed several lawsuits against the process of copyright law.

Openai did not respond to the request.

Exit mobile version