AI Engineering

LLM Gateway Architecture: Token Routing, Caching, and Failover in Production

June 14, 2026·12 min read

LLM InfrastructureSystem DesignCost Optimization

Foundation model API spend hit $12.5 billion in 2025 — up from $3.5 billion in late 2024 — and 72% of organizations plan to grow that number further in 2026. Most of that budget is leaking through three holes: routing every request to a frontier model regardless of complexity, paying token costs for responses you've already computed, and running a single provider dependency that goes down and takes your entire app with it.

An LLM gateway closes all three. It's the proxy layer between your application and every model provider: a unified API surface that handles routing decisions, serves cached responses, and reroutes traffic when a provider degrades. In 2026, running production AI without one is the equivalent of deploying a web service without a load balancer or health check.

This guide covers how token routing, caching, and failover work inside a gateway — the mechanisms, the architecture decisions, and the failure modes that determine whether it's a cost lever or an incident waiting to happen.

Key Takeaways

In 2025, foundation model API spend reached $12.5 billion — up 3× in a single year. Intelligent routing and caching are now required to stay solvent (Maxim AI, 2026).

Semantic caching returns responses in under 5ms vs. 2-5 seconds for live inference, with 30-40% typical hit rates in production (Digital Applied, 2026).

Cost-based model routing alone cuts 40-70% of token spend by routing simple requests to cheaper models (Lushbinary, 2026).

Combining prompt caching, semantic caching, and model routing delivers 47-80% total cost reduction.

In a July 2025 benchmark at identical CPU allocation, Kong's data plane processed 859% more requests per second than LiteLLM and 228% more than Portkey (Kong Inc., 2025).

What Is an LLM Gateway — and Why Is It Now Table Stakes?

An LLM gateway is the proxy between your application and one or more model providers. It abstracts provider-specific authentication, request shapes, and error codes behind a unified API, while adding a control plane for routing, caching, rate limiting, budget enforcement, and observability.

The shift from convenience to required infrastructure happens as soon as three compounding problems appear at scale:

Provider concentration risk. Hard-coding a single provider means a degraded API is your outage. OpenAI, Anthropic, and Google each publish SLAs allowing roughly 44 minutes of downtime per month. Without a failover path, that window is your window.

Routing inefficiency. Production LLM workloads are almost never uniform. A typical application sends classification calls, summarization tasks, and complex multi-step chains to the same frontier model — paying GPT-4.1 or Claude Opus pricing for tasks a model at one-tenth the cost handles identically. Routing doesn't require a sophisticated classifier; it requires a deliberate policy.

No spend control. Provider APIs enforce rate limits, not budget guardrails. Without a gateway, there's no mechanism to reject overrides before they hit your invoice, attribute spend to the team or feature generating it, or kill a runaway prompt loop before it drains a monthly budget in an hour.

For a deeper look at the rate-limiting side of this, see LLM Rate Limiting Strategies at Scale.

The leading gateways in 2026 — LiteLLM (100+ providers, ~40k GitHub stars), Portkey (250+ providers), Kong AI Gateway, Cloudflare AI Gateway (330 global edge nodes), and OpenRouter — all solve the same three problems. Where they diverge is throughput ceiling, operational complexity, and which of the three problems they solve best.

How Does Token Routing Work in Production?

In 2026, at 100,000 queries per day, routing simple requests to cheaper models while reserving frontier capacity for complex tasks yields annualized savings exceeding $150,000 (Maxim AI, 2026). Token routing is the mechanism that makes those savings systematic rather than dependent on individual engineering discipline.

A gateway-level router intercepts every outbound request and applies a routing policy before any provider call happens. Four strategies dominate production:

Cost-based routing assigns each model a cost score and selects the cheapest option that meets a minimum capability threshold you define. OpenRouter's implementation uses an inverse-square cost algorithm (1/p²) that aggressively weights cheaper providers within a quality band. You set the floor; the router finds the cheapest path that clears it.

Latency-based routing tracks rolling p95 latency per model and shifts traffic toward the fastest provider at request time. This matters for interactive applications where time-to-first-token is user-visible. LiteLLM ships six routing strategies out of the box, including latency-based and usage-aware variants.

Complexity-based routing classifies each request — by prompt length, task type, or a lightweight model — and routes accordingly. Simple requests (classification, intent detection, short Q&A) go to smaller, cheaper models. Long-context reasoning, code generation, and multi-step agentic tasks go to frontier models. Hybrid systems using this approach achieve 37-46% reduction in LLM usage by diverting basic requests to rule-based or smaller-model paths entirely.

Load-balanced routing distributes traffic across multiple API keys or provider instances to stay under per-key rate limits. Portkey's implementation uses weighted key distribution with automatic circuit-breaking when a key hits its ceiling.

The practical problem with complexity-based routing is cold-start classification overhead. If your classifier is a separate LLM call, you've added latency and token cost to save latency and token cost. In production, this works best with a fine-tuned SLM running locally — Phi-4 or Gemma 3 1B handle routing classification in under 10ms on CPU. The routing model costs nothing per-call since you're not making an API request. See Fine-Tuning vs RAG vs Prompting for how to scope a fine-tune of this kind.

Cost reduction per strategy at the midpoint of reported production ranges. Combining all three delivers 47-80% total reduction.

What Caching Strategy Cuts Your LLM Bill the Most?

In 2026, semantic cache hits return responses in under 5ms compared to 2-5 seconds for live inference, and production deployments typically hit 30-40% cache hit rates (Digital Applied, 2026). Each hit eliminates 100% of the token cost for that call. Three distinct caching layers exist, and they stack independently.

Exact-match caching stores the full prompt-response pair and returns the cached result when an identical prompt arrives. This works well for templated requests — health checks, repeated system-prompt-only calls, or any application with a narrow and predictable input distribution. Hit rates reach 70-80% for those patterns and collapse to near-zero for open-ended user inputs.

Semantic caching uses vector similarity to match semantically equivalent prompts that aren't textually identical. "What's your return policy?" and "Can I return this item?" map to the same cache slot when their embeddings fall within a cosine similarity threshold — typically tunable between 0.90 and 0.98 depending on how much drift you'll tolerate. Organizations with high question repetition report 15-30% cost reductions from semantic caching alone; those using a layered approach report 40-60% (Maxim AI, 2026).

Provider-level prompt caching operates inside the model provider's infrastructure on repeated token prefixes — system prompts, long context, attached documents. Anthropic's prefix caching delivers 90% cost reduction and 85% latency reduction for prompts with frequently reused prefixes (Introl, December 2025). OpenAI's automatic caching, enabled by default, achieves 50% reduction on cached tokens. See How to Cut LLM API Costs with Prompt Caching and Model Routing for the full mechanics.

The implementation pipeline in most gateways follows a cache-then-route order:

code

Request
  → Exact cache check (Redis / in-memory)
  → Semantic cache check (vector DB, similarity > threshold)
  → Provider prompt cache (prefix match at the provider layer)
  → Live inference → write all three layers on miss

The right cache investment depends entirely on your request distribution. Inspect your prompt logs for 30 days and plot the top-N most-repeated semantic clusters. If more than 50% of traffic falls into 20-30 intent buckets, semantic caching pays for itself in week one. If the distribution is long-tailed — most prompts unique — concentrate on prompt caching for the system prompt prefix and accept that semantic caching will underperform until volume scales.

How Do You Build a Failover Chain That Actually Holds?

A failover chain is only as reliable as its health detection and its fallback sequencing. Most gateway implementations get the mechanics right and the edge cases wrong.

The three categories of fallback that matter in production:

General failover handles provider downtime, rate-limit exhaustion, and network errors. You define a priority-ordered list of providers — primary → secondary → tertiary. When the primary returns a 5xx or times out, the gateway retries the next provider in the chain without the caller seeing any error. Portkey's implementation allows up to five retry attempts with exponential backoff before escalating to the next fallback provider.

Content-policy fallback handles the case where a provider refuses a request with a 400 content moderation rejection. This is distinct from infrastructure failure — the provider is healthy, it just won't serve this specific input. You need a separate fallback configured with a provider that has different policy boundaries for your use case.

Context-window fallback handles requests that exceed a provider's maximum token limit. Rather than returning an error, the gateway routes to a provider with a larger context window automatically, without requiring your application to handle context_length_exceeded errors.

According to the 2025 multi-agent stress-testing benchmark (MAST, arXiv:2503.13657), 41.77% of multi-agent failures trace to specification and system design — the exact failure category that a well-configured failover chain addresses at the infrastructure layer before it becomes an agent-level error.

The failure mode most teams miss is cascading retry amplification. When a provider starts responding slowly — p95 climbing to 10 seconds instead of timing out — requests pile up waiting for responses. Retries spawn duplicate in-flight requests against an already-struggling provider. If the gateway doesn't implement circuit breaking, one slow provider creates a queue that eventually overwhelms healthy providers too. The circuit breaker pattern — stop sending new requests to a degraded provider immediately, not after N timeouts, then reopen after a health check confirms recovery — is the architectural piece that separates gateways that hold from ones that make outages worse.

LiteLLM, Kong, and Portkey all implement circuit breakers, but they require explicit configuration. The defaults are too permissive for high-traffic production. Set a short timeout (3-5s for interactive requests, 30s for batch), circuit-trip after three consecutive failures, and require a successful health probe before reopening.

Three-provider failover chain with circuit breaker. The gateway routes to the next provider on 5xx, timeout, or circuit trip.

Which LLM Gateway Should You Use?

Spend level is the clearest decision axis. A July 2025 benchmark from Kong, measuring requests per second at identical CPU allocation, found Kong's data plane processed 859% more RPS than LiteLLM and 228% more than Portkey, with 86% lower p95 latency than LiteLLM and 65% lower than Portkey (Kong Inc., 2025).

Raw throughput isn't the only dimension:

Under ~$10K/month: Start with LiteLLM. It covers 100+ providers, ships six routing strategies, implements semantic caching, and carries the largest OSS community. The throughput ceiling only becomes relevant above ~2,000 RPS — most early-stage products don't approach that.
$10K–$50K/month: LiteLLM if you're self-hosting and have the maintenance bandwidth. Cloudflare AI Gateway wins on latency for geographically distributed users by deploying to 330 global edge nodes without additional operational overhead.
Above $50K/month: Portkey or Kong. At this spend, audit trails, prompt injection guardrails, team-level budget enforcement, and SLA-backed support pay for themselves. LiteLLM's memory behavior under sustained load — 8GB+ RAM with cascading timeouts at 2,000 RPS in the 2025 benchmark — becomes a real production risk.
Enterprise/regulated: Kong AI Gateway, which runs inside your own infrastructure and integrates with existing Kong-based API management. This matters when customer data can't route through a third-party managed service.

The hidden inflection point is $2,000/month in token spend: at that level, a managed gateway's fee (~$110/month) is cheaper than the engineering time required to operate a self-hosted instance (~10 hours/month at typical rates). Run the calculation before defaulting to open-source.

For a framework on evaluating any production AI system, see How to Evaluate Your LLM Agent.

LLM gateway comparison by provider coverage, caching capability, target spend tier, and cost model.

What Should You Instrument at the Gateway Layer?

A gateway without observability is a black box that saves money until it silently breaks in a way you can't diagnose. Four metrics worth instrumenting from day one:

Token cost per request by model. Aggregate this per team, per feature, and per user type. Without this breakdown, a cost spike could come from a routing misconfiguration, a prompt length regression, or a user segment generating unusually expensive requests — and you won't know which.

Cache hit rate by request type. A 30% overall hit rate can mask a 70% hit rate on FAQ traffic and 5% on everything else. Segmenting by request category tells you where to invest in better caching versus where to accept cache misses.

Provider error rates and latency by percentile. Track p50, p95, and p99 per provider. P50 looks fine right until p99 shows a provider degrading — that's when the circuit breaker should have already tripped.

Fallback activation rate. If more than 2% of requests are hitting secondary providers, the primary is struggling. Above 10%, investigate before users notice. Most gateways expose this as a native metric; if yours doesn't, it's a gap worth filing against.

Most gateways export OpenTelemetry spans by default. Pipe them to your existing stack — Datadog, Grafana, Honeycomb — rather than adopting a gateway-specific dashboard that fragments your observability surface.

Frequently Asked Questions

What's the difference between an LLM gateway and an API proxy?

An API proxy forwards requests and handles authentication. An LLM gateway adds a token-level control plane: routing decisions based on cost and latency, semantic caching that matches similar-but-not-identical prompts, and failover chains that recover from provider outages without any error reaching the caller. The distinction is the intelligence sitting above the forwarding layer.

How much can a gateway realistically save?

At 100,000 queries per day, cost-based routing alone yields $150,000+ in annualized savings by routing simple requests to cheaper models (Maxim AI, 2026). Combined with semantic caching (30-40% production hit rate) and prompt caching (50-90% reduction on repeated prefixes), total cost reduction typically lands at 47-80%. The exact number depends on how repetitive and how tier-sortable your request distribution is.

Does adding a gateway increase latency?

For cache hits, the opposite: semantic and exact cache hits return in under 5ms versus 2-5 seconds for live inference. For cache misses, a self-hosted gateway adds 1-5ms of forwarding overhead. Cloudflare's edge deployment eliminates most of that by placing gateway nodes within 50ms of any global user.

When should I avoid semantic caching?

When small differences in user input produce meaningfully different correct outputs — medical advice, legal guidance, personalized financial data. Semantic caching assumes semantically similar queries have interchangeable answers. That assumption breaks for high-stakes personalized responses. Use exact-match caching or no caching for those patterns, and reserve semantic caching for lookup-style and FAQ workloads.

Is LiteLLM production-ready at high traffic?

At moderate traffic under ~2,000 RPS, yes. At sustained high load, the July 2025 benchmark shows LiteLLM consuming 8GB+ RAM with cascading timeouts before the upstream model layer saturates (Kong Inc., 2025). If you're approaching that ceiling, benchmark your workload on all gateway candidates before committing.

A gateway is the control plane your production AI stack is missing. Without it, you're paying frontier model prices for classification tasks, regenerating answers you've already computed, and one provider SLA away from a user-visible outage.

The entry point is lower than most teams assume: LiteLLM deploys in under an hour and covers routing, caching, and failover for free. Start there. Add semantic caching once you have 30 days of prompt logs to analyze your request distribution. Wire up a failover chain before users hit the product. Instrument everything from day one — cost per model, cache hit rate, provider latency at p99, fallback activation rate.

The ceiling is equally high: Kong's data plane handles 859% more throughput than the OSS default with full audit trails and governance for teams where those things matter. The architecture is the same. What changes is the operational context you drop it into.

For the broader picture of how a gateway fits into a production agent stack, see Context Engineering for AI Agents.

Sources