AI Engineering

Context Engineering for AI Agents: The Production Guide

June 6, 2026·14 min read

Agent ArchitectureLLMsContext Engineering

All 18 frontier LLMs tested — including Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro — showed performance degradation as input length increased, across every length increment tested. That's not a finding about one model or one provider. It's a structural property of how current LLMs process context, and it holds across the entire landscape (Chroma Context Rot Research, Jul 2025).

Engineers obsess over model selection. The bigger lever is what you put in front of the model.

Context engineering is the discipline of designing, assembling, and managing the entire information environment an LLM receives on each call. Get it wrong and you're paying 3× more per task, watching accuracy crater on long conversations, and debugging failures that look like model errors but aren't. Get it right and you cut input costs by up to 90%, hold accuracy flat across 20-turn conversations, and build agents that degrade gracefully instead of catastrophically.

This guide covers the six components of a production agent's context window, four failure patterns to avoid, and the techniques that separate agents that stay in production from ones that get quietly shut down.

Key Takeaways

In July 2025, Chroma found all 18 tested frontier LLMs degrade non-linearly as context grows — there's no escape by switching models.

System prompts consume 69% of all input tokens in production — the biggest cost driver, yet least often optimized (Datadog, Mar 2026).

Prompt caching cuts input costs up to 90%; only 28% of production LLM calls use it (Datadog, Mar 2026).

Memory tiering becomes cheaper than long-context models after ~10 conversation turns, saving ~26% by turn 20 (arXiv:2603.04814, Mar 2026).

41.77% of multi-agent failures trace to specification and system design — the category context engineering directly addresses (MAST, arXiv:2503.13657).

What Is Context Engineering — And Why It's Not Prompt Engineering

In June 2025, Andrej Karpathy defined it precisely: context engineering is "the delicate art and science of filling the context window with just the right information for the next step." That definition matters because it separates two disciplines engineers frequently conflate.

Prompt engineering is phrasing — writing instructions, personas, and output formats. It operates at the level of individual messages. Context engineering is systems architecture: deciding what information enters the window, in what order, at what granularity, and at what point in a task's lifecycle. It spans memory retrieval, tool selection, compression strategies, caching, and multi-agent state scoping.

The scope difference has real performance consequences. In September 2025, Anthropic published research on multi-agent browsing tasks showing token usage explains 80% of performance variance across runs (Anthropic Engineering, Sep 2025). Not model choice. Not prompting style. The information load on each call.

The implication: two identical agents with the same model and same task — one with a naive context strategy, one with an engineered one — show dramatically different cost and accuracy profiles after 15 turns. The model is a constant in that equation. Context design is the variable.

According to Anthropic's September 2025 research on multi-agent browsing tasks, token usage explained 80% of performance variance — making context architecture a more actionable lever than model selection for most production teams. This means the architectural decisions about what enters the context window, not which frontier model you choose, determine whether an agent performs reliably at turn 20 the same way it does at turn 1.

For the foundational architecture decisions that precede context design, see our guide to building production AI agents.

What Lives Inside an AI Agent's Context Window?

In March 2026, Datadog's State of AI Engineering report analyzed token distribution across production LLM workloads and found system prompts consume 69% of all input tokens — the largest cost driver in production, and the segment teams optimize least (Datadog State of AI Engineering, Mar 2026). The median Datadog customer also saw tokens per request more than double year-over-year.

A production agent's context window contains six distinct components, each with different cost and relevance profiles:

System prompt — instructions, persona, constraints, output format rules
Tool/function definitions — schemas for every tool the agent can call
Short-term memory — the current conversation or task history
Long-term memory — facts retrieved from a vector store or graph database
RAG-retrieved documents — chunks pulled for the current query
Agent scratchpad — intermediate reasoning steps and tool results

The problem isn't that these components exist — it's that most teams inject all of them on every call regardless of relevance. A system prompt designed for a customer support agent gets sent with every turn of a code review task. Tool definitions for a web browser get loaded when the current step only needs a calculator.

Production token distribution across LLM workloads. Source: Datadog State of AI Engineering, March 2026.

The audit question every team should answer: what percentage of your system prompt tokens are relevant to the current task? For most production agents, the honest answer is under 40%.

For a deeper look at the trade-offs between retrieval and persistent memory, see RAG vs agent memory.

The 4 Context Failure Patterns That Kill Agent Performance

In July 2025, Chroma published research on what they named "context rot" — consistent, measurable performance degradation affecting every frontier model they tested as input length grew, regardless of advertised context window size (Chroma Context Rot Research, Jul 2025). All 18 models showed the pattern. None were immune.

Context rot is one of four distinct failure patterns engineers need to understand before they build.

1. Context Rot

Non-linear performance degradation as token count grows. ByteByteGo's analysis of the Chroma data documented accuracy drops from 95% to 60% as context exceeded model-specific thresholds (blog.bytebytego.com, Apr 2026). The degradation isn't gradual — it accelerates. The last 20K tokens of a 100K-token context cause disproportionately more damage than the first 20K.

A separate arXiv study (2510.05381, Oct 2025) quantified it more precisely: context length alone causes 13.9%–85% performance degradation even with perfect retrieval. Adding more relevant content past a threshold actively hurts rather than helps.

According to Chroma's July 2025 context rot research, all 18 frontier LLMs tested showed consistent, non-linear performance degradation as input length increased across every length increment — a structural finding that holds regardless of which model you choose. This makes context size management the primary reliability lever available to every team building production agents, not an optional optimization.

Representative degradation curve across 18 frontier LLMs. Accuracy drops accelerate non-linearly past model-specific thresholds. Source: Chroma, July 2025.

2. Lost in the Middle

Liu et al.'s 2024 TACL study showed a U-shaped attention pattern: LLMs recall information at the beginning and end of a context window better than information placed in the middle. For long contexts, this produces a 30%+ accuracy drop for critical information at the midpoint (Liu et al., MIT Press TACL, 2024).

The implication for agents is direct: critical instructions, hard constraints, and high-priority retrieved chunks belong at the beginning or end of the context — never buried in the middle where attention reliably drops.

3. Tool Overload

Performance degrades measurably beyond 10–20 tool definitions per call. Tool schemas consume tokens and increase the probability of incorrect tool selection. A 2025 arXiv study (2509.21361) found models fail at stated tasks with as little as 100 additional tokens of irrelevant tool context — orders of magnitude less than their advertised window maximum.

4. Context Poisoning

This failure pattern goes unnamed in most production post-mortems, and it's the most insidious in long-running agent systems. The mechanism: an agent produces a hallucination, that hallucination gets written to long-term memory, and on every subsequent call the hallucinated fact gets retrieved and re-injected into context, where it influences future reasoning. Each bad write compounds the error.

The mitigation is a validation layer before any memory write. Verify the claim against a source or against current task output before committing it to the long-term store. Treating memory writes as append-only, low-latency operations — because it's simpler — is the engineering pattern that creates this failure mode.

For detection and eval strategies that surface context-related failures before users do, see evaluating your LLM agent.

How to Engineer Your Agent's Context

In March 2026, Datadog found prompt caching offers up to 90% cost reduction on input tokens for static context segments — yet only 28% of production LLM calls use it (Datadog State of AI Engineering, Mar 2026). That single gap represents the largest optimization most teams aren't doing.

Here are five techniques in order of implementation effort:

Technique 1: Prompt Caching

Identify the static portions of your context — your system prompt, tool definitions, fixed knowledge base content — and mark them for caching. Most major providers (Anthropic, OpenAI) support this today. Savings compound on every subsequent call.

python

# BEFORE: full system prompt billed from scratch on every call
response = client.messages.create(
    model="claude-opus-4-5",
    system=SYSTEM_PROMPT,        # rebilled every turn
    messages=full_history,       # raw transcript, growing each turn
    tools=ALL_TOOLS,             # all tools loaded regardless of task
    max_tokens=4096
)

# AFTER: static prefix cached; only new dynamic content billed fresh
response = client.messages.create(
    model="claude-opus-4-5",
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # cache this static prefix
        }
    ],
    messages=compressed_history,   # compressed, not raw transcript
    tools=relevant_tools,          # filtered to current subtask only
    max_tokens=4096
)

When we profiled a production customer support agent, 71% of input tokens on each call were the static system prompt — rebilled from scratch on every turn. Applying cache_control to that prefix reduced input token costs by 68% within a single deploy. The change was three lines of code.

Technique 2: Context Compression

Don't append full conversation history on every call. Compress completed turns into a rolling summary and keep only the last 2–3 turns verbatim. Two strategies:

Rolling window: keep the N most recent messages, discard the rest. Simple, predictable, loses historical context that may still matter downstream.

Hierarchical summarization: after every N turns, summarize the completed block into structured state (task status, key decisions, open questions). Append the summary instead of the raw turns. This preserves semantic content while cutting token volume 60–80% per block.

For most production agents, hierarchical summarization after every 5–7 turns is the right tradeoff. Rolling window alone is fine for short-lived single-task agents with hard step budgets.

Technique 3: Selective Tool Loading

Load only tool definitions relevant to the current subtask. A browsing agent doing web research doesn't need database write tools. A code generation agent doesn't need email tools. Tool filtering cuts token cost and reduces tool selection errors simultaneously.

python

def filter_tools(all_tools: list[Tool], task_context: str) -> list[Tool]:
    # lightweight classifier maps task intent to tool categories
    needed_categories = tool_classifier.predict(task_context)
    return [t for t in all_tools if t.category in needed_categories]

Technique 4: Memory Tiering

Not all memory belongs in every context window. Structure memory across four tiers and inject each only when it's needed:

Buffer (working memory): current task state — injected on every call
Episodic: recent session history — injected on queries that reference past events
Semantic: persistent domain facts — injected selectively, not by default
Procedural: learned task patterns — embedded in the system prompt only for relevant task types

The mistake most teams make: injecting semantic and procedural memory on every call because it's simpler than building a retrieval layer. That's the most expensive default you can choose.

Technique 5: Multi-Agent Context Scoping

When breaking work across multiple agents, don't pass the full parent context to each subagent. Scope context to the subagent's task. An orchestrator managing five workers should give each worker only the state needed for its specific subtask — not the entire accumulated parent conversation. Every token the orchestrator's history occupies in a worker's context is a token unavailable for the worker's actual task.

For the full cost breakdown on caching strategies and model routing, see prompt caching and model routing.

Context Engineering vs. Long-Context Models — When to Use Which

In March 2026, arXiv:2603.04814 showed memory systems become cheaper than long-context LLMs after approximately 10 conversation turns, with cumulative cost savings of ~26% by turn 20 (arXiv:2603.04814, Mar 2026). The break-even is earlier than most teams expect — and the decision to build a memory architecture needs to happen before you hit that point, not after a month of long-context bills.

According to arXiv:2603.04814's March 2026 cost analysis, memory tiering systems reach cost parity with long-context LLMs at approximately 10 conversation turns and produce ~26% cumulative savings by turn 20. For any agent handling open-ended multi-turn tasks, this makes the investment in a memory architecture straightforward to justify — the question is when to start building it, not whether.

Memory tiering becomes cost-competitive at ~10 turns and saves ~26% by turn 20. Source: arXiv:2603.04814, March 2026.

The decision framework is straightforward:

Under 10 turns, bounded context → long-context model, no memory overhead warranted
Over 10 turns, open-ended tasks → memory tiering beats long-context on cumulative cost
Multi-document QA exception → long-context models outperform memory by 35 points on some benchmarks, so measure before committing

The key phrase is "before committing." Run the cost projection at your actual conversation-length distribution before locking in an architecture.

Measuring and Debugging Context Engineering in Production

In March 2026, Datadog found 52% of organizations lack online evaluation capability for LLM outputs (Datadog State of AI Engineering, Mar 2026) — meaning most teams can't detect context failures until users report them, often days or weeks after the problem starts. You can't fix what you can't observe.

What to instrument on every agent request:

Token distribution per component. Break down each request into: system prompt tokens, tool definition tokens, history tokens, RAG tokens, scratchpad tokens. You can't optimize what you can't see. The number that surprises most teams first: how many tokens tool definitions consume before a single user message arrives.

Cache hit rate. If your hit rate on a static system prompt drops below 70%, something's wrong with your caching implementation — likely a non-deterministic element (timestamp, session ID) embedded in the supposedly static prefix.

Tool call count per turn. A consistent rise in tool calls per turn often signals context degradation. The agent is retrying because earlier context is getting lost.

Memory read/write ratio. A write-heavy ratio that rises over time is the leading indicator of context poisoning. Reads should dominate by a wide margin in a healthy agent.

The debugging workflow when accuracy degrades:

Pull a single trace from a failing run
Identify which context component changed most between the last working call and the first failing call
Check for context rot: is total token count creeping past your model's reliable threshold?
Check for tool overload: did tool count increase without a corresponding task complexity increase?
Check long-term memory: does the failing context contain retrieved content that contradicts current task state?

Only 28% of production LLM calls use prompt caching despite availability on every major provider. If you instrument nothing else, instrument cache hit rate — it's the fastest way to find the 72% of teams leaving 90% cost savings on the table.

Multi-Agent Context Architecture

The MAST study (arXiv:2503.13657, Mar 2025) analyzed 1,600+ annotated agent traces across 7 production frameworks and found 41.77% of multi-agent failures trace to specification and system design failures (MAST, Mar 2025). That's the category context engineering directly addresses — before you write a single line of agent code.

According to the MAST paper's March 2025 analysis of 1,600+ annotated production traces, 41.77% of multi-agent system failures originate in specification and system design, ahead of inter-agent misalignment (36.94%) and task verification failures (21.30%). Context architecture decisions made before implementation determine the plurality of multi-agent failure risk — which means the best time to fix them is before the first line of orchestration code.

Multi-agent failure categories. Specification and design failures — directly addressed by context engineering — are the plurality cause. Source: MAST, arXiv:2503.13657, March 2025.

Three context rules for multi-agent systems:

Scope context to the subagent's task. An orchestrator managing five workers should pass each worker only the state needed for its specific subtask — not the entire accumulated parent conversation. Every token the orchestrator's history occupies in a worker's context window is a token unavailable for the worker's actual work.

Define structured handoff schemas. Free-text inter-agent communication is a reliability anti-pattern. When Agent A hands off to Agent B in unstructured plain text, Agent B will misparse it. Use typed schemas (Pydantic, JSON Schema) for every agent-to-agent message. Schema violations should fail loudly and immediately — not silently produce wrong downstream results.

Put a validation layer before every memory write. In a shared memory store, one agent's hallucination becomes every other agent's retrieved context. Before any agent writes to long-term memory, validate the claim against task state or a verification tool. This is the only reliable mitigation for context poisoning in multi-agent systems where multiple writers share a single store.

For a complete guide to structuring tool schemas for agents, see tool definitions via MCP.

Frequently Asked Questions

What is context engineering?

Context engineering is the discipline of designing, assembling, and managing the full information environment an LLM receives on each call. Andrej Karpathy (June 2025) defined it as "the delicate art and science of filling the context window with just the right information for the next step." It spans memory architecture, retrieval strategy, compression techniques, caching, and tool selection — everything that shapes what the model sees before it reasons about the next action.

How is context engineering different from prompt engineering?

Prompt engineering is about phrasing — how you word instructions, format output requirements, and structure a single message. Context engineering is systems architecture: what information enters the window, from which memory tier, at what granularity, and how it's compressed to stay within reliable limits. Prompt engineering operates on individual messages; context engineering operates on the pipeline that assembles those messages across multiple turns and tool calls.

What goes into an AI agent's context window?

A production agent's context window contains up to six components: the system prompt (instructions, persona, constraints), tool/function definitions, short-term memory (current conversation), long-term memory (retrieved facts), RAG-retrieved documents, and an agent scratchpad for intermediate reasoning. In production, Datadog's 2026 report found system prompts alone consume 69% of all input tokens — the largest cost driver and the least often optimized component.

How do you reduce LLM context costs?

Three highest-ROI interventions: (1) Enable prompt caching on static context segments — Anthropic and OpenAI both support it, cutting input costs up to 90% on cached portions. (2) Compress conversation history with hierarchical summarization instead of appending raw transcripts — cuts token volume 60–80% per compressed block. (3) Filter tool definitions to only those relevant to the current subtask. For multi-turn agents, memory tiering beats long-context LLMs on cumulative cost after ~10 conversation turns, saving ~26% by turn 20 (arXiv:2603.04814, Mar 2026).

What is context rot?

Context rot is the non-linear performance degradation that occurs as an LLM's input token count grows. Chroma's July 2025 research found all 18 tested frontier LLMs showed consistent degradation across every length increment tested — no model was immune. ByteByteGo's analysis documented accuracy drops from 95% to 60% as context exceeded model-specific thresholds. Crucially, the degradation accelerates: the final 20K tokens in a long context cause disproportionately more damage than the first 20K, which is why context size management can't wait until you notice accuracy problems.

Conclusion

Context engineering is a systems discipline, not a prompting trick. The numbers make the priorities clear: 69% of your production tokens are system prompts — audit them first. Prompt caching cuts costs 90% and 72% of teams aren't using it. Memory tiering becomes cost-competitive at turn 10. And 41.77% of multi-agent failures trace to context design decisions made before implementation starts.

The model is a constant. What you put in front of it is the variable you can control.

For the architecture patterns behind these systems, see our guide to building production AI agents. For the full breakdown on caching strategies and model routing, see prompt caching and model routing.

Sources:

Chroma, "Context Rot Research," trychroma.com/research/context-rot, July 2025, retrieved June 2026.
Datadog, "State of AI Engineering," datadoghq.com/state-of-ai-engineering/, March 2026, retrieved June 2026.
Anthropic Engineering, multi-agent browsing task token analysis, anthropic.com/engineering, September 2025, retrieved June 2026.
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," MIT Press Transactions of the Association for Computational Linguistics, direct.mit.edu/tacl/article-abstract/doi/10.1162/tacl_a_00685/119630, 2024.
arXiv:2510.05381, context length degradation study with perfect retrieval conditions, October 2025, retrieved June 2026.
arXiv:2509.21361, token budget constraints in production LLM context windows, September 2025, retrieved June 2026.
arXiv:2603.04814, memory systems vs. long-context model cost analysis for multi-turn agents, March 2026, retrieved June 2026.
MAST, "Why Do Multi-Agent LLM Systems Fail?" arXiv:2503.13657, March 2025, retrieved June 2026.
ByteByteGo, Chroma context rot analysis, blog.bytebytego.com, April 2026, retrieved June 2026.