RAG vs. Agent Memory: When to Use Which
You've shipped a RAG system. It works. Now you're adding agents — and you've hit a design question that most writeups skip: when does the agent retrieve, when does it remember, and when does it need both?
The confusion is understandable. Both RAG and agent memory give an LLM access to information it wasn't trained on. But they solve different problems. Conflating them produces architectures that are overbuilt (everything goes into memory, token costs spiral), underbuilt (pure retrieval with no persistence, the agent can't learn from its own history), or just wrong in ways that only surface under real load.
This isn't a feature comparison. It's a decision guide — for engineers who are past "hello world" with both RAG and agents.
Key Takeaways
- RAG retrieves knowledge from external sources at query time. Agent memory persists state across steps and sessions. These are orthogonal mechanisms — not competing options.
- There are four agent memory types (in-context, episodic, procedural, semantic), each with different persistence, write patterns, and cost profiles.
- The single most useful question: "Was this information generated by the agent, or did it exist before the agent ran?" That split drives 80% of the architecture decision.
- Most production systems use all four memory types simultaneously. The question isn't which one — it's which layer owns which data.
Why Engineers Conflate RAG and Agent Memory
When you built your first RAG pipeline, the mental model was clean. Documents go into a vector store. Queries come in. Relevant chunks come out. The LLM synthesizes a response. Information flows in one direction.
Agents break that model. An agent doesn't just query — it acts, observes, revises, and acts again. Across a 20-step autonomous task, the agent accumulates its own artifacts: intermediate results, tool call outputs, decisions made, paths tried and abandoned. None of that existed in your vector store before the agent ran.
This is the gap that causes engineers to reach for the wrong tool. RAG was designed to efficiently access external knowledge that's too large or too dynamic for the context window. It wasn't designed to answer "what did this agent already try in step 4?" or "what does this user prefer based on the last five sessions?" Those are memory problems.
What RAG Actually Solves — and Where It Ends
RAG was introduced in a 2020 NeurIPS paper from Meta AI Research, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020). The core idea: instead of encoding all knowledge into model weights, retrieve relevant passages at inference time from an external index. The knowledge base changes independently of the model.
RAG is the right mechanism when the information:
- Existed before the agent ran — product docs, a policy library, a codebase, a knowledge base
- Is too large or dynamic for the context window — more than a few dozen pages, or updated frequently
- Can be expressed as a query — you can formulate what you're looking for in language
- Needs source attribution — the caller needs to know where the answer came from
RAG has hard limits. It's stateless by design — each retrieval is independent, with no memory of previous queries. It can only return what's already indexed. It can't track what the agent decided in step 3. And when engineers try to shoehorn it into agent-state roles — using a vector store as a scratchpad for intermediate results — retrieval quality degrades quickly, because fuzzy semantic search is the wrong tool for structured state lookup.
Where this breaks in practice: Teams commonly use RAG for user preference storage — embedding past conversations and retrieving "relevant" history via cosine similarity. The problem is that vector similarity doesn't preserve temporal ordering or logical relationships between events. A structured key-value store retrieves "user's preferred output format" more reliably than semantic search over conversation history.
The Four Agent Memory Types
Agent memory isn't one thing. Well-architected systems rely on four distinct types, each with different persistence characteristics and write patterns.
In-Context Memory
Everything currently in the active prompt window. Zero latency, perfectly accurate, no additional infrastructure. But it resets when the session ends and costs scale with every step of the agent loop — because every token in context is paid for at every iteration. Claude 3.5 Sonnet supports up to 200k tokens; GPT-4o supports up to 128k. For bounded, single-session tasks under ~30 steps with small tool outputs, you may not need anything else.
Episodic Memory
A structured store — vector database or key-value store — that the agent writes to and reads from. Persists across sessions. Captures what the agent did, what worked, what the user said, decisions already made. This is where engineers commonly confuse RAG with memory: episodic memory does involve retrieval, but what's retrieved is the agent's own past state, not external documents.
Procedural Memory
The agent's tools, system prompt, and few-shot examples. Baked in at design time and unchanged at runtime. It encodes how the agent does things — the reasoning patterns and behavioral constraints you've engineered in. Most engineers don't think of system prompts as memory, but they are.
Semantic Memory (Where RAG Lives)
External knowledge the agent reads but doesn't write. Your knowledge base, documentation, a database, an API. The agent queries it, doesn't own it, and doesn't modify it. Each retrieval is independent. This is the correct home for RAG.
Five Questions That Pick the Right Mechanism
Most architecture decisions reduce to five questions asked in order. You rarely need to reach Q5.
Q1 is the most important split. Agent-generated state — tool outputs, intermediate results, decisions made during a run — belongs in memory, never in RAG. RAG can't retrieve something that didn't exist when the index was last built.
Q3 is where RAG earns its place. If the information is external knowledge — docs, databases, a policy library — that's RAG territory. The agent should retrieve, not maintain a local copy.
Q4 is the most underused check. Retrieval adds latency and retrieval errors. If the state fits comfortably in-context, keep it there. Don't add complexity you don't need.
When You Need Both: Production Patterns
Most non-trivial agentic systems end up using all four memory types simultaneously. The question isn't which one — it's which layer owns which data.
Here's what this looks like for a concrete system — a customer support agent:
- Procedural: Escalation rules, tone guidelines, response templates — baked into the system prompt at design time
- In-context: The current conversation, tool call outputs from this session, the user's current query
- Episodic: What this user asked last week, their tier, their past resolved issues, preferred response style
- Semantic / RAG: Product documentation, policy library, known issue database — anything that existed before the agent ran
The signal that you've got it right: each layer has a clear owner (who writes to it?), clear readers (who reads it?), and a clear retention policy (when does it expire or get pruned?).
A five-step agentic loop already runs at roughly 3.2× the token cost of the same task in direct chat mode, according to usage data from teams running Caveman-style token compression experiments. Without compression or memory segmentation, that cost compounds with every step. The layer architecture isn't just cleaner — it's how you control cost growth.
The Failure Modes Worth Knowing
Pure RAG without session state. The agent retrieves correctly but has no memory of what it's already tried. Symptom: it re-fetches the same documents across steps, re-attempts failed tool calls, and contradicts decisions it made three steps earlier. The fix is almost always in-context state management, not a better retrieval model.
Unbounded in-context growth. Works fine for short tasks. For long agentic loops — 50+ steps, large tool outputs — the context window fills and the agent degrades. Either you truncate (losing early context) or costs spiral. Without a memory compression or summarization step, you're paying for a full transcript at every step.
Episodic memory without pruning. The agent writes everything to external memory. After weeks of operation the store grows without bound, and retrieval quality degrades as signal gets buried in noise. Define a maximum age or relevance threshold and prune regularly — treat the episodic store like a log, not a permanent archive.
Vector search for structured state. Trying to retrieve a user's timezone, account tier, or notification preferences through cosine similarity is slower, less reliable, and more expensive than a direct database lookup. Match the retrieval mechanism to the data structure. Fuzzy semantic search is for unstructured language, not enumerated values.
Read Next
If you're building the agent layer that sits on top of this architecture, How to Build AI Agents That Don't Fall Apart in Production covers ReAct vs. Reflexion pattern selection, tool boundary design, and termination conditions in detail.
For the token cost math behind the memory decision — specifically why unbounded in-context growth gets expensive quickly — Caveman: How Stone-Age Grammar Cuts AI Agent Token Costs by 65% walks through the compression strategies that keep agentic costs predictable.
Frequently Asked Questions
Is RAG a type of agent memory?
RAG operates as semantic memory in the broader agent memory taxonomy — it provides read access to external knowledge at query time. But RAG is stateless: it doesn't track what the agent has retrieved before or what decisions were made. Episodic and in-context memory handle the agent's own state, which RAG is not designed to provide.
Should I always implement all four memory types?
Not necessarily. Simple, bounded tasks often only need in-context memory plus RAG for knowledge lookup. The four-layer stack is warranted when tasks span multiple sessions, accumulate complex state, or require learning from past interactions. Each layer adds operational complexity — only add what the task genuinely requires.
Can I use episodic memory to replace in-context memory and cut token costs?
Partially. You can summarize completed steps into episodic memory and drop raw context, which is the core of most memory compression strategies. But the agent still needs an active in-context window for current-step reasoning. Episodic memory compresses and persists what's done; in-context holds what's happening now. The two work together, not in place of each other.
What's the right vector store for episodic memory versus RAG?
They can share infrastructure, but they have different access patterns. RAG queries are semantic and varied — cosine similarity search over document embeddings is the right fit. Episodic memory retrieval is often more structured: "get the last N sessions for user X" or "find tasks where the agent used tool Y." Hybrid stores with filter support (Weaviate, Qdrant, Pinecone) handle both, or use a dedicated key-value database for episodic state and a vector store for RAG.
Conclusion
RAG and agent memory aren't competing options — they're different layers of the same problem. RAG handles external knowledge that the agent reads. Memory handles state that the agent generates, accumulates, and needs to carry forward.
Get the split right and the architecture scales predictably: new knowledge goes in the RAG layer, agent state goes in the memory layer, and each part is independently optimizable. Get it wrong and you end up debugging retrieval failures that are actually memory failures, or watching memory stores grow unbounded because agent-generated state got routed to the wrong layer.
The decision framework reduces to a single question asked first: was this information generated by the agent, or did it exist before the agent ran? Everything else follows from that split.