META’s vanilla Maverick AI model ranks under rivals in a popular chat benchmark

Earlier this week, Meta landed in a hot water to get high scores from the crowdsourcing benchmark LM Arena using an experimental and unclean version of LLAMA 4 Maverick Model. In this case, the manager of LM Arena apologized, changed the policy, and recorded an unmodified vanilla Maverick.

As a result, there is no competitive edge.

Unmodified Mavericks, “LLAMA-4-MAVERICK-17B-128E-Instruct” ranked under the model including Openai’s GPT-4O, Anthropic’s Claude 3.5 Sonnet and Friday, including Google’s Gemini 1.5 PRO. Many of these models have been a few months.

Why isn’t performance not good? META’s experimental Maverick, LLAMA-4-MAVERICK-03-03-26-EXPERIMENTAL, explained that it was “optimized for dialogue” in the chart announced last Saturday. This optimization seems to work in LM Arena, which selects human evaluators compare and prefer the output of the model.

As we wrote before, LM Arena is not the most reliable measurement of the AI ​​model’s performance. Nevertheless, in addition to adjusting the model to the benchmark, the developer has difficulty to predict how well the model will do in another context.

In the statement, a meta spokesman told TechCrunch that TechCrunch experimented with “all types of custom modifications.”

The spokesman said, “LLAMA-4-MAVERICK-03-26-EXPERIMENTAL” is a chat optimization version that is well performed in LMARENA. ”We now publish an open source version and we will see how developers customize LLAMA 4 for their use cases. We are happy to see what they will build and expect their continuous feedback. “