← Back
VentureBeat AI

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

5 min read
#agents#llm#inference
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
Level:Advanced
For:AI Engineers
TL;DR

The Agents' Last Exam (ALE) benchmark has been launched to measure the ability of artificial intelligence to execute economically valuable, long-horizon professional workflows, with OpenAI's GPT-5.5 securing the top spot with a 24.0% pass rate, beating Anthropic's Claude Fable 5 model. The ALE benchmark evaluates models across five functional layers: Brain, Eyes, Body, Hands, and Feet, and uses deterministic, code-based evaluation to compare an agent's artifact against an expert's ground-truth reference. The benchmark consists of 1,490 task instances, covering 55 non-physical industry sub-domains, and is scaling toward a massive 5,000-task target. The results highlight the limitations of current AI models in executing real-world tasks.

⚡ Key Takeaways

  • GPT-5.5 achieved a 24.0% pass rate on the ALE Leaderboard, beating Claude Fable 5's 22.0% score.
  • The ALE benchmark evaluates models across five functional layers: Brain, Eyes, Body, Hands, and Feet.
  • The benchmark uses deterministic, code-based evaluation for 93.2% of its workflows, relying on LLM-as-a-judge grading for only 6.8%.
  • The ALE benchmark consists of 1,490 task instances, covering 55 non-physical industry sub-domains.
  • The benchmark is scaling toward a massive 5,000-task target.
💡 Why It Matters

The ALE benchmark provides a more realistic evaluation of AI models' ability to execute economically valuable, long-horizon professional workflows, highlighting the limitations of current models and the need for further development. The benchmark's focus on deterministic, code-based evaluation and its coverage of a wide range of industry sub-domains make it a valuable tool for assessing the capabi

✅ Practical Steps

  1. Evaluate AI models using the ALE benchmark to assess their ability to execute real-world tasks.
  2. Use the ALE benchmark's five functional layers to identify areas for improvement in AI models.
  3. Apply the ALE benchmark's deterministic, code-based evaluation approach to other areas of AI research.

Want the full story? Read the original article.

Read on VentureBeat AI

More like this

Graviton5’s improved design increases speed and energy efficiency — beyond Moore’s law

Amazon Science#compute

Startup’s nuclear-inspired cooling system could make data centers more sustainable

MIT News AI#compute

Claude Fable 5 is now available on Databricks, fully governed through Unity AI Gateway

Databricks Blog#llm

The Practitioner’s Guide to AgentOps

Machine Learning Mastery#agents

EXPLORE AI NEWS

Daily hand-picked stories on LLMs, RAG, agents and production AI — curated for engineers who ship.

BROWSE NEWS

GET THE WEEKLY DIGEST

Join engineers getting the Monday signal-over-noise AI breakdown. No spam, unsubscribe anytime.

LEARN AI ENGINEERING

Curated courses, research papers, repos and tutorials built for engineers leveling up in AI.

START LEARNING