While some may think that Pokémon is a challenging benchmark for AI, a group of researchers suggests that Super Mario Bros. is actually a more difficult test.
On Friday, the Hao AI Lab, a research organization at the University of California San Diego, pitted AI models against each other in live Super Mario Bros. games. The results showed that Anthropic’s Claude 3.7 performed the best, followed closely by Claude 3.5, while Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled to keep up.
It’s worth noting that the version of Super Mario Bros. used in the experiment was not the original 1985 release, but rather a modified version running in an emulator, integrated with the GamingAgent framework. This framework, developed in-house by Hao, allowed the AI models to control Mario by providing basic instructions, such as “dodge obstacles or enemies by moving or jumping left,” and in-game screenshots.

The GamingAgent framework enabled the AI models to generate inputs in the form of Python code to control Mario’s movements. According to Hao, the game required each model to “learn” complex maneuvers and develop gameplay strategies, making it a challenging benchmark for AI.
Interestingly, the lab found that so-called reasoning models, such as OpenAI’s o1, which approach problems in a step-by-step manner, performed worse than non-reasoning models, despite generally outperforming them in other benchmarks. The main reason for this, according to the researchers, is that reasoning models take longer to decide on actions, often requiring seconds to make a decision.
In a fast-paced game like Super Mario Bros., timing is crucial, and a delay of even a second can mean the difference between success and failure. This highlights the challenges of using reasoning models in real-time games, where swift decision-making is essential.
While games have been used to benchmark AI for decades, some experts have raised questions about the validity of drawing connections between AI’s gaming performance and technological advancement. Games are often abstract and relatively simple, providing a theoretically infinite amount of data to train AI, which may not accurately reflect real-world scenarios.
The recent surge in flashy gaming benchmarks has led to what Andrej Karpathy, a research scientist and founding member at OpenAI, calls an “evaluation crisis.” Karpathy expressed his uncertainty about the current state of AI metrics, stating that he is unsure which metrics to look at or how to evaluate the performance of AI models.
“I don’t really know what [AI] metrics to look at right now,” he wrote in a post on X. “TLDR my reaction is I don’t really know how good these models are right now.” Despite the challenges and uncertainties surrounding AI evaluation, one thing is clear: watching AI play Mario can be entertaining and fascinating.
At least we can enjoy the spectacle of AI playing Super Mario Bros., even if it’s not a perfect measure of its abilities.
Source Link