AWS ML Blog

Evaluating Deep Agents using LangSmith on AWS

May 28, 2026•20 min read•

Level:Intermediate

For:ML Engineers

✦TL;DR

This article presents a practical guide to evaluating deep agents using LangSmith on AWS, combining learnings from LangChain and Anthropic. The guide covers five evaluation patterns and provides a method for building offline evaluations using pytest and LangSmith. Engineers can use this approach to assess the performance of their deep agents in a controlled environment. However, they should be aware that this approach may not capture real-world complexities and edge cases. This guide is particularly useful for ML Engineers looking to optimize their deep agents' performance.

⚡ Key Takeaways

Five evaluation patterns are provided for deep agents.
LangSmith is used for building offline evaluations.
Pytest is used for testing and evaluation.
Engineers need to consider the tradeoff between offline evaluations and real-world performance.
The LangSmith API is used for building evaluations.
Limitation, caveat, or prerequisite: This guide assumes familiarity with LangChain, Anthropic, and pytest.
WhyItMatters: Evaluating deep agents is crucial for optimizing their performance and ensuring they meet production requirements. This guide provides a practical approach to evaluating deep agents using LangSmith on AWS.
TechnicalLevel: Intermediate
TargetAudience: ML Engineers
PracticalSteps:
Install LangSmith and pytest using pip.
Import LangSmith and pytest in your Python script.
Define evaluation patterns and use LangSmith to build offline evaluations.
ToolsMentioned: LangSmith, pytest, LangChain, Anthropic
Tags: LLM, RAG, LANGCHAIN

🔧 Tools & Libraries

LangSmithpytestLangChainAnthropic

💡 Why It Matters

Evaluating deep agents is crucial for optimizing their performance and ensuring they meet production requirements. This guide provides a practical approach to evaluating deep agents using LangSmith on AWS.

✅ Practical Steps

Install LangSmith and pytest using pip.
Import LangSmith and pytest in your Python script.
Define evaluation patterns and use LangSmith to build offline evaluations.

Want the full story? Read the original article.

Read on AWS ML Blog ↗

Evaluating Deep Agents using LangSmith on AWS

⚡ Key Takeaways

🔧 Tools & Libraries

✅ Practical Steps

More like this

Comprehensive observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM quality

The AI agent bottleneck isn't model performance — it's permissions

MeMo's memory model lets teams upgrade their LLM without retraining it — and performance jumps 26%

Baseline Enterprise RAG, From PDF to Highlighted Answer