LLM Evaluation

How to Evaluate Your LLM Agent Without Lying to Yourself

May 24, 2026·12 min read

AI AgentsAgent EngineeringMLOpsProduction AI

In 2025, LangChain surveyed practitioners who already had AI agents running in production. One in five — 22.8% — ran no evaluation at all.

Not bad evals. No evals.

That number sounds shocking until you watch a team ship their first agent. The pressure to demo something is enormous. Evals feel slow, hard to set up, and unclear in what they're measuring. So they get skipped — or replaced by benchmark scores that look decisive but measure almost nothing useful about your actual use case.

This guide is for engineers who know they need to evaluate their agents but aren't sure they're doing it right. We'll cover why the benchmarks you're reading lie to you, how to build offline evals that match your specific task, and what to measure once you're in production.

Key Takeaways

In 2025, 22.8% of teams with production agents run zero evals (LangChain State of Agent Engineering). The bottleneck isn't skill — it's confusion about what to measure.

Frontier models now score 88–93% on MMLU. The gap between them is smaller than the benchmark's measurement error — useless for model selection on your specific task.

LLM-as-judge hits ~85% human agreement on general tasks but drops to 60–68% in expert domains, wide enough to mask serious regressions.

Production evals need four signals beyond accuracy: tool call success rate, latency distribution, cost per completed task, and recovery rate from bad intermediate steps.

Why Do Benchmark Scores Tell You So Little?

In 2025, every frontier model scores between 88% and 93% on MMLU, according to analysis by benchmarkingagents.com. That 5-point spread sounds meaningful. It isn't. The benchmark's own measurement error sits around 6.5% — so the models are statistically tied. You can't use MMLU to pick between GPT-4.1 and Claude Opus today because the noise floor is higher than the signal.

A developer evaluates AI responses on a laptop screen in a focused indoor workspace

The saturation problem is well-documented, but there's a less-discussed failure mode that's worse: benchmark gaming. In February 2026, analysis by dasroot.net — drawing on a Berkeley study of eight major agent benchmarks — found that several could be exploited to near-perfect scores without solving the underlying tasks. Leaked reference answers, unsanitized eval() calls, and scoring functions that skip correctness checks all played a role.

None of this means benchmarks are worthless. They're useful for tracking a model's general capability trajectory. They're nearly useless for predicting whether your specific agent will work on your specific task.

The mental model that helps: benchmarks measure potential, not readiness. A model that scores 91% on MMLU still hallucinated 34% of the time on a legal retrieval task — that's what Stanford HAI found with Westlaw's AI-Assisted Research tool in a 2026 study. High MMLU, high hallucination rate. That gap is exactly what evals are supposed to close.

The most dangerous benchmark is the one you ran last month on your own test set — because that set almost certainly shares the same distribution as your training data, meaning you're measuring memorization, not generalization. The eval feels rigorous; the signal isn't.

What Does "Eval Theater" Look Like in Practice?

Eval theater is what happens when a team has an eval pipeline, runs it regularly, and still ships broken agents. It's more common than skipping evals entirely — and harder to diagnose.

Here's what it looks like. A team builds a 50-question golden dataset. They run it before every release. Pass rate stays above 85%, so they ship. Three weeks after launch, users report that the agent consistently fails on questions about recent events, gets confused when tools return empty results, and sometimes loops indefinitely on ambiguous inputs. None of those failure modes appear in the 50-question set, because the set was built from happy-path examples.

In 2025, only 52.4% of organizations run offline evaluations on test sets at all, and just 37.3% run any form of online evaluation, according to LangChain's State of Agent Engineering. The teams that do eval often concentrate on accuracy and ignore what actually bites them in production: latency degradation under load, tool call reliability, cost blowup when the agent gets confused and retries, and cascade failures in multi-step pipelines.

Nearly a quarter of teams skip evaluation entirely — even after their agents are in production.

The tell that you're in eval theater: your pass rate has been stable for three releases in a row but your user-reported failure rate keeps climbing. Stable eval scores with worsening real-world performance means your test set isn't capturing what users actually do. The eval is giving you a false sense of control.

How Do You Build Offline Evals That Match Your Task?

The best offline eval set has nothing to do with published benchmarks. It comes from your actual usage data — or, if you're pre-launch, from a systematic exercise in imagining every way your agent can fail.

A laptop screen displays a data analysis and coding interface with multi-panel terminal output

Start with four categories of test cases, each targeting a distinct failure mode:

Happy path cases (30%) — The thing you're most confident works. Include these not to feel good about your pass rate, but to catch regressions when you swap models or change your system prompt.

Edge cases (40%) — Inputs near the boundary of what your agent handles. For a customer support agent: questions that are half in-scope and half out-of-scope, inputs with typos or missing context, and requests that should produce "I don't know" rather than a confident wrong answer.

Adversarial cases (20%) — Inputs designed to trigger the failure modes you most fear. Prompt injection attempts if your agent reads external content. Inputs that cause tool call loops. Requests that should be refused but might not be.

Regression cases (10%) — Every bug your users reported, added as a permanent fixture. These grow over time and become the most valuable part of the set.

On sizing: 100–200 test cases is enough to catch the common failure modes without making your suite too slow to run pre-deployment. The composition matters more than the count. In 2025, Cleanlab's survey of 95 production-stage ML practitioners found that 70% of regulated enterprises rebuild their AI agent stack every three months — that velocity means your eval set needs to be fast, not exhaustive.

Keep a human-reviewed subset of 20–30 cases reviewed monthly. Automated scoring drifts. A regular human pass catches the drift before it becomes invisible.

What Should You Measure in Production Beyond Accuracy?

Accuracy is what you can measure easily. It isn't what kills production agents. Here are the four metrics that actually predict user-facing failure.

Tool call success rate — Every time your agent calls a tool and gets an error, there's a chance it halluccinates a fake result rather than surfacing the error cleanly. Track the ratio of successful tool calls to total calls, broken down by tool. If a specific tool has a high failure rate, your agent will develop workarounds that look plausible in logs but produce wrong answers for users.

Latency distribution (p50, p95, p99) — Average latency is a lie. The p95 is what your worst regular users experience. Multi-step agent latency distributions are wide and right-skewed — the tail is very long. A p95 of 45 seconds feels acceptable in testing and completely unacceptable once real users hit it.

Cost per completed task — Not cost per API call. Cost per completed task. An agent that retries three times before succeeding costs 3× what you expect. An agent that loops indefinitely costs until a hard limit fires. In 2025, the MIT NANDA initiative found that 95% of enterprise GenAI pilots failed to achieve rapid revenue acceleration — cost control is a significant part of that failure pattern.

Recovery rate from bad intermediate steps — When your agent calls a tool and gets garbage back, does it recover gracefully or spiral? Track what percentage of multi-step runs experience at least one bad intermediate result, and of those, how many produce a correct final answer. High recovery rate means your orchestration is robust. Low recovery rate means you're depending on every tool call succeeding — a fragile assumption.

Can You Trust LLM-as-Judge to Grade Your Agent?

LLM-as-judge is the most popular automated evaluation approach right now, and for good reason: it scales, it doesn't require labeled data for every new task, and it handles open-ended outputs that rule-based scoring can't. But it has a specific failure mode that's easy to miss.

In October 2025, Han et al. published a study on LLM-as-judge accuracy across task types (arXiv:2510.09738, "Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement"). GPT-4 as a judge achieves roughly 85% agreement with human annotators on general Q&A and code tasks. In expert domains — dietetics, mental health, specialized legal reasoning — that agreement drops to 60–68%.

An abstract sphere of interconnected dots and lines representing an AI neural network Photo by Growtika on Unsplash

That 17–25 point drop isn't a rounding error. If you're building an agent for a domain with real expertise requirements, your automated grades are wrong roughly a third of the time — which means regressions can slip through your eval gate without triggering any alerts.

LLM judges perform well on general tasks but degrade significantly in specialized domains — masking regressions where accuracy matters most.

The fix isn't to abandon LLM-as-judge. It's to calibrate it. Run your judge against 50–100 human-labeled examples from your specific domain before you trust it as a release gate. Measure the agreement rate. If it's below 75%, your automated eval is too noisy to rely on.

Three adjustments that consistently improve judge accuracy: give the judge a detailed rubric instead of asking for binary pass/fail; use a model family different from the one you're evaluating so the judge's blind spots don't overlap with your agent's; ask the judge to explain its reasoning before giving a score — chain-of-thought grading reliably reduces false positives.

The most overlooked LLM-as-judge failure mode isn't accuracy — it's consistency. The same judge, on the same response, often scores differently across runs due to sampling temperature. Run each case at least twice and flag disagreements for human review rather than averaging the scores. An inconsistent grade is more dangerous than a wrong one, because it looks like signal when it's noise.

Ready to ship your first agent? Make sure the architecture is solid before you start worrying about evals. Our Building AI Agents: The Engineer's Guide covers the structural decisions that determine whether your agent is even worth evaluating.

Frequently Asked Questions

How many test cases do I need in my offline eval set?

For most agent tasks, 100–200 test cases is enough to catch common failure modes without making your suite too slow to run pre-deployment. Composition matters more than count: aim for 40% edge cases and 20% adversarial inputs, not just happy-path examples. Add every production bug as a permanent regression case — that list is the most valuable part of the set over time.

What's the difference between offline and online evaluation?

Offline evaluation runs your agent against a fixed test set before deployment — you control the inputs and measure against known-good outputs. Online evaluation tracks real user interactions in production, usually through sampling and logging. In 2025, only 37.3% of teams with production agents run any form of online evaluation (LangChain, State of Agent Engineering). Both are necessary; they catch different failure modes and neither replaces the other.

Can I use the same eval framework for multi-agent systems?

Multi-agent systems need eval at two levels: individual agent capability (does each agent do its job?) and system-level behavior (do agents coordinate without cascading failures?). Most existing eval frameworks only address the first. For system-level evals, build scenario-based tests that exercise the handoff points between agents and measure recovery when one agent fails mid-task.

How do I prevent my eval set from becoming contaminated?

Keep your eval set completely separate from any data used in fine-tuning or few-shot examples. If you're using a third-party dataset, assume it may already be in the model's training data and supplement with cases generated from your own real user logs. For high-stakes applications, commission a third party to generate eval cases — that guarantees they're novel.

What's the best first metric to start tracking in production?

Tool call success rate. It's easy to instrument, it's a leading indicator of agent confusion, and it directly predicts user-facing failure. If a specific tool has a failure rate above 5%, your agent is already building compensatory behavior around it — the kind of fragile workaround that breaks silently when the tool changes its response format.

Conclusion

Evals are the part of agent engineering teams skip when they're moving fast and return to when something breaks in production. In 2025, the MIT NANDA initiative found that 95% of enterprise GenAI pilots failed to achieve expected ROI. Missing the quality bar is consistently in the top reasons.

Benchmark scores don't tell you whether your agent is ready to ship. They tell you whether the model has general capability. Your eval set — built from your actual failure modes, scored against your specific success criteria, and monitored in production with the signals that actually predict user experience — is what tells you whether you're ready.

You don't need a perfect framework on day one. You need four things: 100–200 test cases weighted toward edge cases, a calibrated LLM judge with a measured agreement rate, a production dashboard tracking tool call success rate and cost per task, and a commitment to adding every user-reported bug as a permanent regression case.

Start there. The agents that hold up in production aren't evaluated more than others. They're evaluated more honestly.

Sources

LangChain, State of Agent Engineering, 2025, retrieved 2026-05-24, https://www.langchain.com/state-of-agent-engineering
MIT NANDA Initiative, The GenAI Divide: State of AI in Business 2025 (via Fortune), retrieved 2026-05-24, https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/
Gartner, Agentic AI Project Failure Forecast (via MarTech), retrieved 2026-05-24, https://martech.org/gartner-40-of-agentic-ai-projects-will-fail-making-humans-indispensable/
benchmarkingagents.com, What LLM Benchmarks Don't Measure, retrieved 2026-05-24, https://benchmarkingagents.com/what-these-benchmarks-miss/
dasroot.net, Why Most LLM Benchmarks Are Misleading (citing Berkeley benchmark study), retrieved 2026-05-24, https://dasroot.net/posts/2026/02/llm-benchmark-misleading-accurate-evaluation/
Han et al., Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement, arXiv:2510.09738, October 2025, https://arxiv.org/pdf/2510.09738
Cleanlab, AI Agents in Production 2025, retrieved 2026-05-24, https://cleanlab.ai/ai-agents-in-production-2025/
Stanford HAI, Legal AI Hallucination Study (cited in dasroot.net analysis), February 2026