Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
The Agents' Last Exam (ALE) benchmark has been launched to measure the ability of artificial intelligence to execute economically valuable, long-horizon professional workflows, with OpenAI's GPT-5.5 securing the top spot with a 24.0% pass rate, beating Anthropic's Claude Fable 5 model. The ALE benchmark evaluates models across five functional layers: Brain, Eyes, Body, Hands, and Feet, and uses deterministic, code-based evaluation to compare an agent's artifact against an expert's ground-truth reference. The benchmark consists of 1,490 task instances, covering 55 non-physical industry sub-domains, and is scaling toward a massive 5,000-task target. The results highlight the limitations of current AI models in executing real-world tasks.
⚡ Key Takeaways
- GPT-5.5 achieved a 24.0% pass rate on the ALE Leaderboard, beating Claude Fable 5's 22.0% score.
- The ALE benchmark evaluates models across five functional layers: Brain, Eyes, Body, Hands, and Feet.
- The benchmark uses deterministic, code-based evaluation for 93.2% of its workflows, relying on LLM-as-a-judge grading for only 6.8%.
- The ALE benchmark consists of 1,490 task instances, covering 55 non-physical industry sub-domains.
- The benchmark is scaling toward a massive 5,000-task target.
The ALE benchmark provides a more realistic evaluation of AI models' ability to execute economically valuable, long-horizon professional workflows, highlighting the limitations of current models and the need for further development. The benchmark's focus on deterministic, code-based evaluation and its coverage of a wide range of industry sub-domains make it a valuable tool for assessing the capabi
✅ Practical Steps
- Evaluate AI models using the ALE benchmark to assess their ability to execute real-world tasks.
- Use the ALE benchmark's five functional layers to identify areas for improvement in AI models.
- Apply the ALE benchmark's deterministic, code-based evaluation approach to other areas of AI research.
Want the full story? Read the original article.
Read on VentureBeat AI ↗