← Blog
BlogAI Engineering

How to Cut LLM API Costs by 70%+ Using Prompt Caching and Model Routing

·7 min read
Share:Share on XShare on LinkedIn
LLMsCost OptimizationDeveloper Tools

You built something with an LLM. It works. Users love it. Then your API bill arrives.

If you're routing everything through Claude Sonnet or GPT-4o and not caching a token, you're almost certainly overpaying by 60–80%. Two techniques fix most of it: prompt caching and model routing. Neither requires a rewrite. Both are production-ready today.


The Two Levers

Before the code: a quick mental model.

Prompt caching exploits the fact that most of your API calls repeat large chunks of text — system prompts, RAG context, few-shot examples. Every time you resend those tokens, you pay full price. With caching, you pay 90% less on the re-reads.

Model routing exploits the fact that 70–80% of your queries don't need a frontier model. "Summarize this paragraph" doesn't need Opus. Routing easy tasks to cheap models (Haiku, GPT-4o-mini) while reserving expensive models for hard ones cuts your average cost per call dramatically.

Stack both. The savings compound.


Prompt Caching: The Real Numbers

Here's what the providers actually charge on cached reads vs. standard input tokens:

Provider Model Standard Input Cached Read Discount
Anthropic Claude Sonnet 4.6 $3.00 / 1M $0.30 / 1M 90%
Anthropic Claude Haiku 4.5 $1.00 / 1M $0.10 / 1M 90%
OpenAI GPT-4o $2.50 / 1M $1.25 / 1M 50%
OpenAI GPT-4o-mini $0.15 / 1M $0.075 / 1M 50%
Google Gemini 2.5 Flash $0.30 / 1M $0.03 / 1M 90%

The 90% discount on Anthropic is not a rounding artifact — it's the actual price. Cache writes cost 1.25x the base rate (5-min TTL) and pay off after just one re-read.

A developer on Reddit went from $720 to $72/month by adding a single parameter. That's not a typo.

The math only works if your static content is large enough to cache. If your system prompt is 50 tokens, this doesn't move the needle. If it's a 10,000-token RAG document or tool manifest, this is your biggest lever.

Implementing on Anthropic (Claude)

python
import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """
You are a customer support assistant for Acme Corp.
[...10,000 tokens of product docs, policies, FAQ...]
"""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # <-- this is it
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)

# Check cache hit in response
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")

The first call writes to the cache (1.25x cost). Every subsequent call within 5 minutes reads from it (0.1x cost). With 10+ calls per minute, you're saving 90% on those tokens from the second call onward.

For a 1-hour TTL (higher cache write cost, better for low-frequency but large contexts):

python
"cache_control": {"type": "ephemeral", "ttl": 3600}

Implementing on OpenAI

OpenAI's prompt caching is automatic — no code changes needed. Any prompt longer than 1,024 tokens is eligible. The first ~1,024 tokens of your messages array get cached transparently.

python
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": LONG_SYSTEM_PROMPT},  # auto-cached if >1024 tokens
        {"role": "user", "content": user_message}
    ]
)

# Check cache hits
usage = response.usage
print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}")

You don't enable it — you just benefit from it when your prompts qualify.


Model Routing: The Real Numbers

Here's the price gap between tiers that makes routing so powerful:

Model Input Cost / 1M tokens
GPT-4o-mini $0.15
Claude Haiku 4.5 $1.00
Gemini 2.5 Flash $0.30
GPT-4o $2.50
Claude Sonnet 4.6 $3.00
Claude Opus 4.7 $5.00

GPT-4o is 16x more expensive than GPT-4o-mini. For classification, summarization, extraction, and simple Q&A — the cheap models perform near-identically.

A real-world example: 50,000 daily requests routed entirely through GPT-4o cost ~$397/day. The same workload, routed by task type, cost $25/day — a 93% reduction (Kalvium Labs, 2025).

Most teams make one mistake: they pick one model for the entire product. The right mental model is a fleet — each tier handles the queries it's actually sized for.

A Simple Router Pattern

The simplest version: classify intent first (cheap), then route to the appropriate model.

python
import anthropic

client = anthropic.Anthropic()

CLASSIFIER_PROMPT = """Classify this query into one of:
- simple: basic Q&A, extraction, classification, summarization
- complex: multi-step reasoning, code generation, analysis

Reply with just the category label."""

def route_query(user_message: str) -> str:
    # Step 1: Cheap classification call
    classification = client.messages.create(
        model="claude-haiku-4-5",  # cheapest, fastest
        max_tokens=10,
        system=CLASSIFIER_PROMPT,
        messages=[{"role": "user", "content": user_message}]
    ).content[0].text.strip()

    # Step 2: Route based on classification
    target_model = (
        "claude-haiku-4-5" if classification == "simple"
        else "claude-sonnet-4-6"
    )

    response = client.messages.create(
        model=target_model,
        max_tokens=1024,
        messages=[{"role": "user", "content": user_message}]
    )

    return response.content[0].text

The classification call is so cheap (< 100 tokens on Haiku) that it pays for itself within the first routing decision.

For production, add a third tier:

python
MODEL_TIERS = {
    "simple": "claude-haiku-4-5",       # $1/MTok — classification, extraction, summary
    "moderate": "claude-sonnet-4-6",    # $3/MTok — reasoning, code review, drafting
    "complex": "claude-opus-4-7",       # $5/MTok — architecture, long-form analysis
}

Target distribution: 70% simple, 20% moderate, 10% complex. That blend reduces average cost per token by ~65% vs routing everything through Sonnet.


Stacking Both: The Combined Impact

Here's what happens when you run prompt caching and model routing together on a customer support bot with a large knowledge base:

Before optimization:

  • All queries → Claude Sonnet 4.6 at full price
  • 5,000 tokens system prompt resent on every call
  • 10,000 calls/day
  • Cost: ~$150/day

After optimization:

  • 70% of queries → Haiku (simple lookups)
  • 30% of queries → Sonnet (nuanced questions)
  • 5,000-token system prompt cached → 90% discount on those tokens
  • Cost: ~$30–40/day

Savings: 73–80%.

The important nuance: prompt caching savings scale with your static context size. Model routing savings scale with query volume. Together they hit different cost drivers simultaneously — which is why the combined effect is larger than either alone.


What to Do First

  1. Audit your token distribution. Log usage from every API call for a week. What % is system prompt / static context? What's the range of query complexity?

  2. Add cache_control to your system prompt. If you're on Anthropic, this is a one-line change. If you're on OpenAI, just make sure your system prompt exceeds 1,024 tokens — it's automatic.

  3. Start routing at one split point. Don't build a three-tier classifier on day one. Add one rule: "if the query is under N words and doesn't mention [complex domain], use the cheap model." Measure. Tune.

  4. Monitor cache hit rates. Anthropic surfaces cache_read_input_tokens in every response. OpenAI surfaces cached_tokens in prompt_tokens_details. Track these in your logging. Below 50% hit rate means your cache window is too short or your prompts aren't structured to reuse the prefix.


FAQ

Does prompt caching affect response quality?

No. The model receives the exact same tokens — they're just loaded from cache instead of being reprocessed. Output is identical.

What's the minimum context size worth caching?

On Anthropic, caching pays off if the cached content exceeds ~500 tokens (given the 1.25x write cost). In practice, anything under 300 tokens isn't worth the complexity.

Can I use both OpenAI and Anthropic in the same router?

Yes. LiteLLM is the standard abstraction layer for multi-provider routing. It normalizes the API surface and exposes a single interface for routing decisions.

How do I pick the classification threshold?

Start conservative — only route clearly simple queries to the cheap tier. Measure accuracy degradation. Widen the threshold until you see quality slip, then pull back one notch. Most teams end up routing 60–75% of queries to the cheap tier without any user-visible quality drop.

Is there a risk of cache misses spiking costs?

Cache write tokens cost 1.25x (5-min TTL) or 2x (1-hr TTL) on Anthropic. If your hit rate collapses, you pay slightly more than uncached for the write. Design your prompts to maximize prefix reuse — put all static content first, dynamic content last.

Related Posts

Weekly Digest

Get the best AI engineering posts, weekly

No hype. Curated signal every Sunday.

← All posts