VentureBeat AI

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

June 10, 2026•5 min read•

Level:Advanced

For:AI Engineers

✦TL;DR

The Agents' Last Exam (ALE) benchmark has been launched to measure the ability of artificial intelligence to execute economically valuable, long-horizon professional workflows, with OpenAI's GPT-5.5 securing the top spot with a 24.0% pass rate, beating Anthropic's Claude Fable 5 model. The ALE benchmark evaluates models across five functional layers: Brain, Eyes, Body, Hands, and Feet, and uses deterministic, code-based evaluation to compare an agent's artifact against an expert's ground-truth reference. The benchmark consists of 1,490 task instances, covering 55 non-physical industry sub-domains, and is scaling toward a massive 5,000-task target. The results highlight the limitations of current AI models in executing real-world tasks.

⚡ Key Takeaways

GPT-5.5 achieved a 24.0% pass rate on the ALE Leaderboard, beating Claude Fable 5's 22.0% score.
The ALE benchmark evaluates models across five functional layers: Brain, Eyes, Body, Hands, and Feet.
The benchmark uses deterministic, code-based evaluation for 93.2% of its workflows, relying on LLM-as-a-judge grading for only 6.8%.
The ALE benchmark consists of 1,490 task instances, covering 55 non-physical industry sub-domains.
The benchmark is scaling toward a massive 5,000-task target.

💡 Why It Matters

The ALE benchmark provides a more realistic evaluation of AI models' ability to execute economically valuable, long-horizon professional workflows, highlighting the limitations of current models and the need for further development. The benchmark's focus on deterministic, code-based evaluation and its coverage of a wide range of industry sub-domains make it a valuable tool for assessing the capabi

✅ Practical Steps

Evaluate AI models using the ALE benchmark to assess their ability to execute real-world tasks.
Use the ALE benchmark's five functional layers to identify areas for improvement in AI models.
Apply the ALE benchmark's deterministic, code-based evaluation approach to other areas of AI research.

Want the full story? Read the original article.

Read on VentureBeat AI ↗

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

⚡ Key Takeaways

✅ Practical Steps

More like this

Graviton5’s improved design increases speed and energy efficiency — beyond Moore’s law

Startup’s nuclear-inspired cooling system could make data centers more sustainable

Claude Fable 5 is now available on Databricks, fully governed through Unity AI Gateway

The Practitioner’s Guide to AgentOps

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

⚡ Key Takeaways

✅ Practical Steps

More like this

Graviton5&#8217;s improved design increases speed and energy efficiency &#8212; beyond Moore&#8217;s law

Startup’s nuclear-inspired cooling system could make data centers more sustainable

Claude Fable 5 is now available on Databricks, fully governed through Unity AI Gateway

The Practitioner’s Guide to AgentOps

Graviton5’s improved design increases speed and energy efficiency — beyond Moore’s law