The debate over AI benchmarks and their reporting by AI labs has become a public issue.
Recently, an OpenAI employee alleged that Elon Musk’s AI company, xAI, had published misleading benchmark results for its latest AI model, Grok 3. However, Igor Babushkin, one of the co-founders of xAI, defended the company’s actions.
The reality of the situation lies somewhere in between the two claims.
xAI published a graph on its blog showcasing Grok 3’s performance on AIME 2025, a set of challenging math questions from a recent invitational mathematics exam. Although some experts have questioned the validity of AIME as an AI benchmark, it is still commonly used to test a model’s math abilities.
The graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, outperforming OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. However, OpenAI employees on X were quick to point out that xAI’s graph did not include o3-mini-high’s AIME 2025 score at “cons@64.”
To clarify, “cons@64” refers to “consensus@64,” a method that gives a model 64 attempts to answer each problem in a benchmark and takes the most frequently generated answers as the final answers. Omitting this from a graph can make it seem like one model surpasses another when that may not be the case.
The scores for Grok 3 Reasoning Beta and Grok 3 mini Reasoning on AIME 2025 at “@1” — meaning the first score the models received on the benchmark — are actually lower than o3-mini-high’s score. Additionally, Grok 3 Reasoning Beta trails slightly behind OpenAI’s o1 model set to “medium” computing. Nevertheless, xAI is marketing Grok 3 as the “world’s smartest AI.”
Babushkin argued on X that OpenAI has also published misleading benchmark charts in the past, albeit comparing the performance of its own models. A neutral party in the debate created a more “accurate” graph showing nearly every model’s performance at cons@64:
It’s amusing to see some people viewing my plot as an attack on OpenAI and others as an attack on Grok, when in reality it’s about DeepSeek propaganda. (I actually think Grok looks good, and OpenAI’s TTC tactics behind o3-mini-*high*-pass@”””1″”” deserve more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic
— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025
However, as AI researcher Nathan Lambert noted in a post, the most crucial metric remains unknown: the computational and monetary cost required for each model to achieve its best score. This highlights how little AI benchmarks reveal about models’ limitations and strengths.