On Monday, a Meta executive disputed a rumor claiming that the company’s new AI models were trained to perform well on specific benchmarks while hiding their weaknesses.
Ahmad Al-Dahle, Meta’s VP of generative AI, stated on X that it is “completely false” that Meta’s Llama 4 Maverick and Llama 4 Scout models were trained on “test sets,” which are collections of data used to evaluate a model’s performance after training. Training on a test set could artificially inflate a model’s benchmark scores, giving a misleading impression of its capabilities.
The rumor, which began circulating on X and Reddit over the weekend, claimed that Meta had artificially inflated its new models’ benchmark results. The rumor appears to have originated from a post on a Chinese social media site by a user who claimed to have resigned from Meta in protest over the company’s benchmarking practices.
Reports of Maverick and Scout struggling with certain tasks and Meta’s decision to use an unreleased version of Maverick to achieve better scores on the LM Arena benchmark fueled the rumor. Researchers on X have noted significant differences in the behavior of the publicly downloadable Maverick model compared to the model hosted on LM Arena.
Al-Dahle acknowledged that some users are experiencing “inconsistent quality” with Maverick and Scout across different cloud providers hosting the models.
“Since we released the models as soon as they were ready, we anticipate it will take several days for all public implementations to be optimized,” Al-Dahle said. “We will continue to work on bug fixes and onboarding partners.”
Source Link