← Blog
BlogAI-assisted

RAG vs. Agent Memory: When to Use Which

·11 min readAI EngineeringRAGAgent ArchitectureLLMsMemory Management

You've shipped a RAG system. It works. Now you're adding agents — and you've hit a design question that most writeups skip: when does the agent retrieve, when does it remember, and when does it need both?

The confusion is understandable. Both RAG and agent memory give an LLM access to information it wasn't trained on. But they solve different problems. Conflating them produces architectures that are overbuilt (everything goes into memory, token costs spiral), underbuilt (pure retrieval with no persistence, the agent can't learn from its own history), or just wrong in ways that only surface under real load.

This isn't a feature comparison. It's a decision guide — for engineers who are past "hello world" with both RAG and agents.

Key Takeaways

  • RAG retrieves knowledge from external sources at query time. Agent memory persists state across steps and sessions. These are orthogonal mechanisms — not competing options.
  • There are four agent memory types (in-context, episodic, procedural, semantic), each with different persistence, write patterns, and cost profiles.
  • The single most useful question: "Was this information generated by the agent, or did it exist before the agent ran?" That split drives 80% of the architecture decision.
  • Most production systems use all four memory types simultaneously. The question isn't which one — it's which layer owns which data.

Why Engineers Conflate RAG and Agent Memory

When you built your first RAG pipeline, the mental model was clean. Documents go into a vector store. Queries come in. Relevant chunks come out. The LLM synthesizes a response. Information flows in one direction.

Agents break that model. An agent doesn't just query — it acts, observes, revises, and acts again. Across a 20-step autonomous task, the agent accumulates its own artifacts: intermediate results, tool call outputs, decisions made, paths tried and abandoned. None of that existed in your vector store before the agent ran.

This is the gap that causes engineers to reach for the wrong tool. RAG was designed to efficiently access external knowledge that's too large or too dynamic for the context window. It wasn't designed to answer "what did this agent already try in step 4?" or "what does this user prefer based on the last five sessions?" Those are memory problems.


What RAG Actually Solves — and Where It Ends

RAG was introduced in a 2020 NeurIPS paper from Meta AI Research, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020). The core idea: instead of encoding all knowledge into model weights, retrieve relevant passages at inference time from an external index. The knowledge base changes independently of the model.

RAG is the right mechanism when the information:

  • Existed before the agent ran — product docs, a policy library, a codebase, a knowledge base
  • Is too large or dynamic for the context window — more than a few dozen pages, or updated frequently
  • Can be expressed as a query — you can formulate what you're looking for in language
  • Needs source attribution — the caller needs to know where the answer came from

RAG has hard limits. It's stateless by design — each retrieval is independent, with no memory of previous queries. It can only return what's already indexed. It can't track what the agent decided in step 3. And when engineers try to shoehorn it into agent-state roles — using a vector store as a scratchpad for intermediate results — retrieval quality degrades quickly, because fuzzy semantic search is the wrong tool for structured state lookup.

Where this breaks in practice: Teams commonly use RAG for user preference storage — embedding past conversations and retrieving "relevant" history via cosine similarity. The problem is that vector similarity doesn't preserve temporal ordering or logical relationships between events. A structured key-value store retrieves "user's preferred output format" more reliably than semantic search over conversation history.


The Four Agent Memory Types

Agent memory isn't one thing. Well-architected systems rely on four distinct types, each with different persistence characteristics and write patterns.

MEMORY TYPE PERSISTS WRITTEN BY LATENCY BEST FOR In-Context Active prompt window Session only LLM + tools Zero Current task state, active tool outputs Episodic / External Vector store or KV store Cross-session Agent writes 10–50 ms User history, past decisions, preferences Procedural System prompt + tools Design-time Engineer Zero Reasoning patterns, behavioral constraints Semantic / RAG External vector index Pre-existing Humans / systems 20–100 ms Docs, knowledge bases, external reference data
The four agent memory types and their key characteristics

In-Context Memory

Everything currently in the active prompt window. Zero latency, perfectly accurate, no additional infrastructure. But it resets when the session ends and costs scale with every step of the agent loop — because every token in context is paid for at every iteration. Claude 3.5 Sonnet supports up to 200k tokens; GPT-4o supports up to 128k. For bounded, single-session tasks under ~30 steps with small tool outputs, you may not need anything else.

Episodic Memory

A structured store — vector database or key-value store — that the agent writes to and reads from. Persists across sessions. Captures what the agent did, what worked, what the user said, decisions already made. This is where engineers commonly confuse RAG with memory: episodic memory does involve retrieval, but what's retrieved is the agent's own past state, not external documents.

Procedural Memory

The agent's tools, system prompt, and few-shot examples. Baked in at design time and unchanged at runtime. It encodes how the agent does things — the reasoning patterns and behavioral constraints you've engineered in. Most engineers don't think of system prompts as memory, but they are.

Semantic Memory (Where RAG Lives)

External knowledge the agent reads but doesn't write. Your knowledge base, documentation, a database, an API. The agent queries it, doesn't own it, and doesn't modify it. Each retrieval is independent. This is the correct home for RAG.


Five Questions That Pick the Right Mechanism

Most architecture decisions reduce to five questions asked in order. You rarely need to reach Q5.

Q1 Q2 Q3 Q4 Q5 Was this information generated by the agent? Tool call outputs, intermediate results, decisions made during this run YES → Episodic or In-Context Memory NO → Continue to Q2 Does this information need to survive after this session ends? User preferences, task history, cross-session state YES → External / Episodic Memory NO → In-Context is sufficient — stop here if it fits the window Is it external knowledge that existed before the agent ran? Documents, databases, wikis, APIs — things you didn't build the agent to create YES → RAG / Semantic Memory NO → Continue to Q4 Does the full relevant state fit in the context window comfortably? Under ~70% of window limit, with room for tool outputs at each step YES → In-Context Memory (simplest — don't add retrieval you don't need) NO → Continue to Q5 Is the data structured (user state, task status) or unstructured (documents)? STRUCTURED → KV / relational store with direct lookup UNSTRUCTURED → Vector store with semantic search
Answer in order — most decisions resolve before Q5

Q1 is the most important split. Agent-generated state — tool outputs, intermediate results, decisions made during a run — belongs in memory, never in RAG. RAG can't retrieve something that didn't exist when the index was last built.

Q3 is where RAG earns its place. If the information is external knowledge — docs, databases, a policy library — that's RAG territory. The agent should retrieve, not maintain a local copy.

Q4 is the most underused check. Retrieval adds latency and retrieval errors. If the state fits comfortably in-context, keep it there. Don't add complexity you don't need.


When You Need Both: Production Patterns

Most non-trivial agentic systems end up using all four memory types simultaneously. The question isn't which one — it's which layer owns which data.

PRODUCTION AGENT — MEMORY LAYERS PROCEDURAL System prompt · Tool definitions · Reasoning patterns · Behavioral rules Set at design time · Never changes at runtime · Zero latency IN-CONTEXT Current task state · Active tool outputs · Working memory for this step Volatile — resets after session · Zero latency · Paid at every agent step EPISODIC User preferences · Past sessions · What was tried · Decisions history Persists across sessions · Agent writes · KV or vector retrieval · 10–50ms SEMANTIC / RAG Product docs · Knowledge base · Policies · Codebase · External APIs Pre-existing · Agent reads only · Vector search · 20–100ms
The four layers in a production agent. Each layer has a clear owner and clear write patterns.

Here's what this looks like for a concrete system — a customer support agent:

  • Procedural: Escalation rules, tone guidelines, response templates — baked into the system prompt at design time
  • In-context: The current conversation, tool call outputs from this session, the user's current query
  • Episodic: What this user asked last week, their tier, their past resolved issues, preferred response style
  • Semantic / RAG: Product documentation, policy library, known issue database — anything that existed before the agent ran

The signal that you've got it right: each layer has a clear owner (who writes to it?), clear readers (who reads it?), and a clear retention policy (when does it expire or get pruned?).

A five-step agentic loop already runs at roughly 3.2× the token cost of the same task in direct chat mode, according to usage data from teams running Caveman-style token compression experiments. Without compression or memory segmentation, that cost compounds with every step. The layer architecture isn't just cleaner — it's how you control cost growth.


The Failure Modes Worth Knowing

Pure RAG without session state. The agent retrieves correctly but has no memory of what it's already tried. Symptom: it re-fetches the same documents across steps, re-attempts failed tool calls, and contradicts decisions it made three steps earlier. The fix is almost always in-context state management, not a better retrieval model.

Unbounded in-context growth. Works fine for short tasks. For long agentic loops — 50+ steps, large tool outputs — the context window fills and the agent degrades. Either you truncate (losing early context) or costs spiral. Without a memory compression or summarization step, you're paying for a full transcript at every step.

Episodic memory without pruning. The agent writes everything to external memory. After weeks of operation the store grows without bound, and retrieval quality degrades as signal gets buried in noise. Define a maximum age or relevance threshold and prune regularly — treat the episodic store like a log, not a permanent archive.

Vector search for structured state. Trying to retrieve a user's timezone, account tier, or notification preferences through cosine similarity is slower, less reliable, and more expensive than a direct database lookup. Match the retrieval mechanism to the data structure. Fuzzy semantic search is for unstructured language, not enumerated values.


Read Next

If you're building the agent layer that sits on top of this architecture, How to Build AI Agents That Don't Fall Apart in Production covers ReAct vs. Reflexion pattern selection, tool boundary design, and termination conditions in detail.

For the token cost math behind the memory decision — specifically why unbounded in-context growth gets expensive quickly — Caveman: How Stone-Age Grammar Cuts AI Agent Token Costs by 65% walks through the compression strategies that keep agentic costs predictable.


Frequently Asked Questions

Is RAG a type of agent memory?

RAG operates as semantic memory in the broader agent memory taxonomy — it provides read access to external knowledge at query time. But RAG is stateless: it doesn't track what the agent has retrieved before or what decisions were made. Episodic and in-context memory handle the agent's own state, which RAG is not designed to provide.

Should I always implement all four memory types?

Not necessarily. Simple, bounded tasks often only need in-context memory plus RAG for knowledge lookup. The four-layer stack is warranted when tasks span multiple sessions, accumulate complex state, or require learning from past interactions. Each layer adds operational complexity — only add what the task genuinely requires.

Can I use episodic memory to replace in-context memory and cut token costs?

Partially. You can summarize completed steps into episodic memory and drop raw context, which is the core of most memory compression strategies. But the agent still needs an active in-context window for current-step reasoning. Episodic memory compresses and persists what's done; in-context holds what's happening now. The two work together, not in place of each other.

What's the right vector store for episodic memory versus RAG?

They can share infrastructure, but they have different access patterns. RAG queries are semantic and varied — cosine similarity search over document embeddings is the right fit. Episodic memory retrieval is often more structured: "get the last N sessions for user X" or "find tasks where the agent used tool Y." Hybrid stores with filter support (Weaviate, Qdrant, Pinecone) handle both, or use a dedicated key-value database for episodic state and a vector store for RAG.


Conclusion

RAG and agent memory aren't competing options — they're different layers of the same problem. RAG handles external knowledge that the agent reads. Memory handles state that the agent generates, accumulates, and needs to carry forward.

Get the split right and the architecture scales predictably: new knowledge goes in the RAG layer, agent state goes in the memory layer, and each part is independently optimizable. Get it wrong and you end up debugging retrieval failures that are actually memory failures, or watching memory stores grow unbounded because agent-generated state got routed to the wrong layer.

The decision framework reduces to a single question asked first: was this information generated by the agent, or did it exist before the agent ran? Everything else follows from that split.