BridgeBench Shows Top AI Models at 10% Accuracy Despite Strong Reasoning
Casablanca – BridgeBench, a new benchmarking project focused on AI reasoning, has released a ranking that exposes a gap between how confidently models explain answers and how often those answers are correct.
The benchmark tests models on reasoning-heavy tasks and scores them across three metrics. Accuracy measures whether the final answer is correct. Evidence evaluates how well the model supports its reasoning with verifiable steps or sources. The overall score combines both, aiming to reward systems that not only answer, but also justify.
In the latest results, xAI’s Grok 4.20 Reasoning model ranks first with a score of 41.8. It records 10.0% accuracy and 89.7% on evidence. OpenAI’s GPT-5.4 follows closely with a score of 40.6, matching the same 10.0% accuracy and slightly stronger evidence at 90.6%.
Anthropic’s Claude Opus 4.7 comes third at 40.3, but with lower accuracy at 6.7%, offset by the highest evidence score among the top models at 91.3%.
Read also: Google Launches AI-powered Desktop Search App for Windows
In fourth place is Grok 4.20, the non-reasoning version, scoring 40.0 with 6.7% accuracy and 89.9% evidence. Claude Opus 4.6 rounds out the top five with a score of 39.6, posting 10.0% accuracy and 86.1% evidence.
Further down, Google’s Gemini 3.1 Pro ranks 15th with a score of 34.3. Its accuracy drops sharply to 3.3%, despite an evidence score of 89.1%.
What makes the ranking striking is not who leads, but how low the accuracy remains across all models. Even the top systems only answer correctly about one in ten times.
At the same time, their evidence scores are consistently high, raising questions about what exactly is being measured. If models can produce convincing chains of reasoning while still being wrong most of the time, the benchmark may be capturing fluency more than reliability.
Morocco World News is also on X — check out our latest posts now! Get MWN on iOS and Android for instant access to breaking news.
The post BridgeBench Shows Top AI Models at 10% Accuracy Despite Strong Reasoning appeared first on Morocco World News.





