
Discussions on the AI benchmark and how to go to AI Labs are openly poured out.
This week, Openai staff accused ELON Musk’s AI company XAI that it published a benchmark results that are misunderstood for the latest AI model Grok 3. XAI co -founder Igor Babushkin insisted that the company is in its rights.
The truth is somewhere in the meantime.
In the post on the blog of XAI, the company recently published a graph that shows Grok 3’s performance in AIME 2025, a challenging mathematics question collection of invited mathematics exams. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, tests of AIME 2025 and previous versions are generally used to investigate the mathematics of the model.
The graph of XAI showed two variations: Grok 3, Grok 3 Beta and Grok 3 MINI reasoning, and OPENAI’s most performance model O3-Mini-High, AIME 2025, but X’s Openai staff is XAI’s graph of XAI I pointed out quickly. “Conss@64” did not contain the O3-Mini-High AIME 2025 score.
What is @64? Well, it is short of “consensus@64”, and basically, Model 64 attempts to answer each problem on the benchmark and provides the most frequently created answers in the final answer. As you can imagine, COS@64 tends to increase the benchmark score of the model, and if you omit it in the graph, one model may actually go beyond other models.
The first score of the “@1” for the AIME 2025 of the Grok 3 reasoned beta and the Grok 3 mini reasoning is lower than the O3-Mini-High score. Grok 3 reasoning beta continues after the O1 model of Openai, which is set as a “middle” computing. However, XAI advertises Grok 3 as “the smartest AI in the world.”
Babushkin argued that Openai compared its performance, but that Openai posted a benchmark chart with similar misunderstandings in the past. The more neutral parties in the discussion, the more “accurate” graphs, showing the performance of almost all models in COS@64.
Some people are interesting ways to see how my conspiracy looks like Openai and other people’s attacks on Grok. Actually it is DeepSeek Propaganda.
(Actually, Grok thinks it looks good there and O3-Mini-*High*-pass@”“ 1 ”“ Openai’s TTC Chicory is deserved to be more investigated.) Https://t.co pic.twitter.com/3wh8foufic-Teortaxes ▶ ️ (Deepseek Twitter 🐋Iron Powder 2023 -∞) (@TEORTAXEX) February 20, 2025
But as the AI researcher Nathan Lambert pointed out in the post, the most important indicator remains a mystery. Calculation (and monetary) costs used to obtain the best score for each model. It shows how little AI benchmarks are about the limitations and strengths of the model.









