Among the recently released flagship AI models by Meta, Maverick has achieved a notable ranking, securing the second position on LM Arena, a platform where human evaluators compare model outputs to determine preference. However, it appears that the version of Maverick used for this evaluation differs from the one made available to the general developer community.
Several prominent AI researchers and experts on X highlighted that Meta’s announcement described the LM Arena version of Maverick as an “experimental chat version.” Furthermore, a chart on the official Llama website reveals that the LM Arena testing utilized “Llama 4 Maverick optimized for conversationality,” indicating a specialized configuration.
In our previous discussions, we touched upon the limitations of LM Arena as a reliable performance metric for AI models. While AI companies may customize their models, they typically do not openly disclose such modifications or create specialized versions for benchmarking purposes. The practice of tailoring a model to excel in a specific benchmark, only to release a different version, poses challenges for developers seeking to predict the model’s performance in various contexts and can be misleading.
Benchmarks, despite their inherent shortcomings, should ideally provide a comprehensive view of a model’s capabilities and weaknesses across a broad range of tasks. The discrepancy between the LM Arena version and the publicly available Maverick model has been observed by researchers on X, who noted significant differences in behavior, including the extensive use of emojis and lengthy responses from the LM Arena variant.
Llama 4 seems a bit overcooked, what’s with all the verbosity? pic.twitter.com/y3GvhbVz65
— Nathan Lambert (@natolambert) April 6, 2025
Notably, the Llama 4 model on Arena uses a substantial number of emojis.
In contrast, the version on together.ai appears more refined: pic.twitter.com/f74ODX4zTt
— Tech Dev Notes (@techdevnotes) April 6, 2025
We have reached out to Meta and Chatbot Arena, the organization responsible for maintaining LM Arena, for further comment on this matter.