A recent research paper published by AI lab Cohere, Stanford, MIT, and Ai2 has raised concerns about the practices of LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena. The authors of the paper allege that LM Arena has been favoring certain AI companies, including Meta, OpenAI, Google, and Amazon, by allowing them to privately test multiple variants of their AI models and selective publication of scores.
According to the research, LM Arena permitted these companies to test several versions of their AI models in private, without publishing the results of the lower-performing models. As a result, these companies were able to achieve higher scores on the Chatbot Arena leaderboard, giving them an unfair advantage over their competitors. The authors argue that this practice undermines the integrity of the benchmarking process.
In an interview with TechCrunch, Sara Hooker, VP of AI research at Cohere and co-author of the study, stated, “Only a handful of companies were informed that private testing was available, and the amount of private testing that some companies received was significantly more than others. This is a clear case of gamification.” Hooker’s comments highlight the need for greater transparency and fairness in the benchmarking process.
Chatbot Arena, launched in 2023 as an academic research project at UC Berkeley, has become a widely-used benchmark for evaluating AI models. The platform works by pitting two AI models against each other in a “battle,” with users voting on which model performs better. While many commercial actors participate in Chatbot Arena, LM Arena has consistently maintained that its benchmark is impartial and fair.
However, the paper’s authors claim to have uncovered evidence that contradicts this claim. For instance, Meta was able to privately test 27 model variants on Chatbot Arena between January and March, prior to the release of its Llama 4 model. Only the score of the top-performing model was publicly disclosed, which conveniently ranked near the top of the Chatbot Arena leaderboard.
Techcrunch event
Berkeley, CA
|
June 5

In response to the allegations, LM Arena Co-Founder and UC Berkeley Professor Ion Stoica stated that the study is filled with “inaccuracies” and “questionable analysis.” The organization claimed that it is dedicated to fair, community-driven evaluations and invites all model providers to submit more models for testing to improve their performance on human preference.
Armand Joulin, a principal researcher at Google DeepMind, also disputed some of the study’s numbers, stating that Google only sent one Gemma 3 AI model to LM Arena for pre-release testing. Hooker responded to Joulin, promising that the authors would make the necessary corrections.
Supposedly favored labs
The paper’s authors began their research in November 2024 after discovering that some AI companies might be receiving preferential access to Chatbot Arena. They analyzed over 2.8 million Chatbot Arena battles over a five-month period and found evidence that LM Arena allowed certain AI companies to collect more data from Chatbot Arena by having their models appear in a higher number of model “battles.”
This increased sampling rate gave these companies an unfair advantage, according to the authors. They claim that using additional data from LM Arena could improve a model’s performance on Arena Hard, another benchmark maintained by LM Arena, by 112%. However, LM Arena argued that Arena Hard performance does not directly correlate to Chatbot Arena performance.
Hooker emphasized that it is unclear how certain AI companies received priority access, but it is essential for LM Arena to increase its transparency regardless. In a post on X, LM Arena pointed to a blog post indicating that models from non-major labs appear in more Chatbot Arena battles than the study suggests.
One notable limitation of the study is its reliance on “self-identification” to determine which AI models were in private testing on Chatbot Arena. The authors prompted AI models several times about their company of origin and relied on the models’ answers to classify them – a method that is not foolproof.
However, Hooker noted that when the authors shared their preliminary findings with LM Arena, the organization did not dispute them. TechCrunch reached out to Meta, Google, OpenAI, and Amazon for comment, but none of them responded immediately.
LM Arena in hot water
The authors of the paper recommend that LM Arena implement several changes to make Chatbot Arena more fair. These include setting a clear and transparent limit on the number of private tests AI labs can conduct and publicly disclosing scores from these tests.
In a post on X, LM Arena rejected these suggestions, claiming that it has published information on pre-release testing since March 2024. The organization argued that it “makes no sense to show scores for pre-release models which are not publicly available” because the AI community cannot test the models for themselves.
The researchers also suggested that LM Arena could adjust Chatbot Arena’s sampling rate to ensure that all models appear in the same number of battles. LM Arena has indicated that it will create a new sampling algorithm, which may address this concern.
The paper comes weeks after Meta was caught optimizing one of its Llama 4 models for “conversationality,” which helped it achieve an impressive score on Chatbot Arena’s leaderboard. However, the company never released the optimized model, and the vanilla version performed much worse on Chatbot Arena.
At the time, LM Arena stated that Meta should have been more transparent in its approach to benchmarking. The incident highlights the need for greater transparency and accountability in the benchmarking process, particularly when it comes to private companies like LM Arena.
Earlier this month, LM Arena announced that it is launching a company, with plans to raise capital from investors. The study increases scrutiny on private benchmark organizations and whether they can be trusted to assess AI models without corporate influence clouding the process.
Source Link