
Maverick, one of the new flagship AI models announced on Saturday, ranked second in LM Arena and a test that chooses to compare and prefer the output of the model. However, the Maverick version that Meta deploys to LM Arena is different from the version that developers are widely used.
As several AI researchers pointed out in X, Meta announced that LM Arena’s Maverick is a “experimental chat version.” Meanwhile, the chart of the official LLAMA website said that META’s LM Arena Testing was performed using “Llama 4 Mavericks optimized for dialogue”.
As we wrote before, LM Arena is not the most reliable measurement of the AI model’s performance. However, the AI company generally did not customize or fine -adjust the model to get better scores in LM Arena.
The problem of adjusting and pending the model to the benchmark and then disclosing the “vanilla” deformation of the same model is that it is difficult to accurately predict how well the model will do in a particular situation. There is also a misunderstanding. Ideally, the benchmark (benchmark) offers snapshots for the strengths and weaknesses of a single model across a variety of tasks.
In fact, X researchers observed a greater difference in the behavior of Maverick, which can be downloaded openly compared to the hosted models in LM Arena. The LM Arena version seems to use a lot of emoticons and provide a tremendous long clan.
Okay llama 4 is Little Cooked Lol, what is this YAP City? pic.twitter.com/Y3GVHBVZ65
-Nathan Lambert (@natolambert) April 6, 2025
For some reason, Arena’s LLAMA 4 models use much more emoticons.
together. AI, it looks better: pic.twitter.com/F74ODX4ZTT
-Tech Dev Notes (@techDEVNOTES) April 6, 2025
We contacted Meta and Chatbot Arena, an organization that maintains LM Arena.