You built something with an LLM. It works. Users love it. Then your API bill arrives.
If you're routing everything through Claude Sonnet or GPT-4o and not caching a token, you're almost certainly overpaying by 60–80%. Two techniques fix most of it: prompt caching and model routing. Neither requires a rewrite. Both are production-ready today.
The Two Levers
Before the code: a quick mental model.
Prompt caching exploits the fact that most of your API calls repeat large chunks of text — system prompts, RAG context, few-shot examples. Every time you resend those tokens, you pay full price. With caching, you pay 90% less on the re-reads.
Model routing exploits the fact that 70–80% of your queries don't need a frontier model. "Summarize this paragraph" doesn't need Opus. Routing easy tasks to cheap models (Haiku, GPT-4o-mini) while reserving expensive models for hard ones cuts your average cost per call dramatically.
Stack both. The savings compound.
Prompt Caching: The Real Numbers
Here's what the providers actually charge on cached reads vs. standard input tokens:
| Provider | Model | Standard Input | Cached Read | Discount |
|---|---|---|---|---|
| Anthropic | Claude Sonnet 4.6 | $3.00 / 1M | $0.30 / 1M | 90% |
| Anthropic | Claude Haiku 4.5 | $1.00 / 1M | $0.10 / 1M | 90% |
| OpenAI | GPT-4o | $2.50 / 1M | $1.25 / 1M | 50% |
| OpenAI | GPT-4o-mini | $0.15 / 1M | $0.075 / 1M | 50% |
| Gemini 2.5 Flash | $0.30 / 1M | $0.03 / 1M | 90% |
The 90% discount on Anthropic is not a rounding artifact — it's the actual price. Cache writes cost 1.25x the base rate (5-min TTL) and pay off after just one re-read.
A developer on Reddit went from $720 to $72/month by adding a single parameter. That's not a typo.
The math only works if your static content is large enough to cache. If your system prompt is 50 tokens, this doesn't move the needle. If it's a 10,000-token RAG document or tool manifest, this is your biggest lever.
Implementing on Anthropic (Claude)
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """
You are a customer support assistant for Acme Corp.
[...10,000 tokens of product docs, policies, FAQ...]
"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # <-- this is it
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check cache hit in response
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
The first call writes to the cache (1.25x cost). Every subsequent call within 5 minutes reads from it (0.1x cost). With 10+ calls per minute, you're saving 90% on those tokens from the second call onward.
For a 1-hour TTL (higher cache write cost, better for low-frequency but large contexts):
"cache_control": {"type": "ephemeral", "ttl": 3600}
Implementing on OpenAI
OpenAI's prompt caching is automatic — no code changes needed. Any prompt longer than 1,024 tokens is eligible. The first ~1,024 tokens of your messages array get cached transparently.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": LONG_SYSTEM_PROMPT}, # auto-cached if >1024 tokens
{"role": "user", "content": user_message}
]
)
# Check cache hits
usage = response.usage
print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}")
You don't enable it — you just benefit from it when your prompts qualify.
Model Routing: The Real Numbers
Here's the price gap between tiers that makes routing so powerful:
| Model | Input Cost / 1M tokens |
|---|---|
| GPT-4o-mini | $0.15 |
| Claude Haiku 4.5 | $1.00 |
| Gemini 2.5 Flash | $0.30 |
| GPT-4o | $2.50 |
| Claude Sonnet 4.6 | $3.00 |
| Claude Opus 4.7 | $5.00 |
GPT-4o is 16x more expensive than GPT-4o-mini. For classification, summarization, extraction, and simple Q&A — the cheap models perform near-identically.
A real-world example: 50,000 daily requests routed entirely through GPT-4o cost ~$397/day. The same workload, routed by task type, cost $25/day — a 93% reduction (Kalvium Labs, 2025).
Most teams make one mistake: they pick one model for the entire product. The right mental model is a fleet — each tier handles the queries it's actually sized for.
A Simple Router Pattern
The simplest version: classify intent first (cheap), then route to the appropriate model.
import anthropic
client = anthropic.Anthropic()
CLASSIFIER_PROMPT = """Classify this query into one of:
- simple: basic Q&A, extraction, classification, summarization
- complex: multi-step reasoning, code generation, analysis
Reply with just the category label."""
def route_query(user_message: str) -> str:
# Step 1: Cheap classification call
classification = client.messages.create(
model="claude-haiku-4-5", # cheapest, fastest
max_tokens=10,
system=CLASSIFIER_PROMPT,
messages=[{"role": "user", "content": user_message}]
).content[0].text.strip()
# Step 2: Route based on classification
target_model = (
"claude-haiku-4-5" if classification == "simple"
else "claude-sonnet-4-6"
)
response = client.messages.create(
model=target_model,
max_tokens=1024,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
The classification call is so cheap (< 100 tokens on Haiku) that it pays for itself within the first routing decision.
For production, add a third tier:
MODEL_TIERS = {
"simple": "claude-haiku-4-5", # $1/MTok — classification, extraction, summary
"moderate": "claude-sonnet-4-6", # $3/MTok — reasoning, code review, drafting
"complex": "claude-opus-4-7", # $5/MTok — architecture, long-form analysis
}
Target distribution: 70% simple, 20% moderate, 10% complex. That blend reduces average cost per token by ~65% vs routing everything through Sonnet.
Stacking Both: The Combined Impact
Here's what happens when you run prompt caching and model routing together on a customer support bot with a large knowledge base:
Before optimization:
- All queries → Claude Sonnet 4.6 at full price
- 5,000 tokens system prompt resent on every call
- 10,000 calls/day
- Cost: ~$150/day
After optimization:
- 70% of queries → Haiku (simple lookups)
- 30% of queries → Sonnet (nuanced questions)
- 5,000-token system prompt cached → 90% discount on those tokens
- Cost: ~$30–40/day
Savings: 73–80%.
The important nuance: prompt caching savings scale with your static context size. Model routing savings scale with query volume. Together they hit different cost drivers simultaneously — which is why the combined effect is larger than either alone.
What to Do First
Audit your token distribution. Log
usagefrom every API call for a week. What % is system prompt / static context? What's the range of query complexity?Add cache_control to your system prompt. If you're on Anthropic, this is a one-line change. If you're on OpenAI, just make sure your system prompt exceeds 1,024 tokens — it's automatic.
Start routing at one split point. Don't build a three-tier classifier on day one. Add one rule: "if the query is under N words and doesn't mention [complex domain], use the cheap model." Measure. Tune.
Monitor cache hit rates. Anthropic surfaces
cache_read_input_tokensin every response. OpenAI surfacescached_tokensinprompt_tokens_details. Track these in your logging. Below 50% hit rate means your cache window is too short or your prompts aren't structured to reuse the prefix.
FAQ
Does prompt caching affect response quality?
No. The model receives the exact same tokens — they're just loaded from cache instead of being reprocessed. Output is identical.
What's the minimum context size worth caching?
On Anthropic, caching pays off if the cached content exceeds ~500 tokens (given the 1.25x write cost). In practice, anything under 300 tokens isn't worth the complexity.
Can I use both OpenAI and Anthropic in the same router?
Yes. LiteLLM is the standard abstraction layer for multi-provider routing. It normalizes the API surface and exposes a single interface for routing decisions.
How do I pick the classification threshold?
Start conservative — only route clearly simple queries to the cheap tier. Measure accuracy degradation. Widen the threshold until you see quality slip, then pull back one notch. Most teams end up routing 60–75% of queries to the cheap tier without any user-visible quality drop.
Is there a risk of cache misses spiking costs?
Cache write tokens cost 1.25x (5-min TTL) or 2x (1-hr TTL) on Anthropic. If your hit rate collapses, you pay slightly more than uncached for the write. Design your prompts to maximize prefix reuse — put all static content first, dynamic content last.
Related Posts
Weekly Digest
Get the best AI engineering posts, weekly
No hype. Curated signal every Sunday.