Openai’s O3 AI model has a lower score on the benchmark than the company that was initially suggested.

The inconsistency between the first and third -party benchmark results for Openai’s O3 AI model raises questions about the company’s transparency and model test practices.

When Openai unveiled O3 in December, the company argued that this model could be the fourth question about FrontierMath, a challenging mathematical set. The score blows the competition. The next highest model was correctly answered only about 2% of the FrontierMath problem.

Mark Chen, the chief research director of Openai, said, “All products today have less than 2% of Frontiermath. You can get more than 25%. “

As a result, the figure will be the upper limit of O3 version with more computing than the model released last week.

EPOCH AI, the research institute of Frontiermath, published O3’s independent benchmark test results on Friday. EPOCH was found to have an O3 about 10%more than the highest claims of Openai.

This does not mean Openai LIED. The company announced in December shows a low score that matches the observed score Epoch. EPOCH also mentioned that the test settings are more likely to be different from Openai and have used FrontierMath’s updated release for evaluation.

“The difference between our results and OpenAI can be due to the more powerful internal scapolds with more test time (computing), or the result is a 180 to 290 problems of Frontiermath (Frontiermath-2024-11-26) of Frontiermath. It may be because it was executed in Frontiermath-2025-02-28-Private.

According to the posts of the ARC Prize Foundation X, the Public O3 model, an organization that tests the O3’s free release version, confirms EPOCH’s report as “another model adjusted for chat/product use.”

ARC Prize said, “All released O3 Compute Tiers is smaller than we benchmarked, usually, larger computing layers can be expected to achieve better benchmark scores.

Wenda Zhou of Openai, a member of the technical employee, said that during the live stream last week, the O3 of production was “more optimized for actual use cases.” As a result, it added that it can indicate a benchmark “inconsistency.”

Zhou said, “(W) has done (optimization) (model) more costly (and) to make it more useful to make it more useful.” We still think we still think -this is a much better model (…). This is the real of this (type) model. “

Of course, the fact that O3’s public releases do not meet the Test promise of Openai is a little important because the company’s O3-Mini-HIGH and O4-Mini models surpass O3 in Frontiermath and plan to debut more powerful O3-Pro in the next few weeks.

However, the AI ​​benchmark is not the best for face value, especially if the source is a service to sell.

As the supplier competes to capture headline and mindshare as a new model, the AI ​​industry is benchmarking “debate.”

In January, EPOCH was criticized for waiting to reveal funds in Openai after the company announced O3. Many scholars who contributed to FrontierMath did not know about Openai’s participation until it was released.

More recently, ELON MUSK’s XAI has been accused of publishing a benchmark chart that is misunderstood for the latest AI model Grok 3. This month, META admitted that the company’s benchmark scores of the company and other models provided by the developer.

4:21 PM Pacific Update: Last week, the opinion of Wenda Zhou, an Openai technology employee of Livestreamream, was added.