LLM Engineering

LLM Rate Limiting Strategies at Scale — Patterns That Work

May 29, 2026·9 min read

Rate LimitingProduction AIBackend EngineeringMLOps

A single agent loop hits a 429. Your retry logic fires. So does every other concurrent request. In four minutes you've burned $800 in retried tokens, your P99 latency is 22 seconds, and your on-call engineer is looking at a spike that looks like a DDoS — but it's your own code.

In February 2026, Datadog's State of AI Engineering report found that 60% of all LLM errors in production were rate-limit errors. Not model errors. Not timeouts. Rate limits — a problem that's entirely self-inflicted once you know what to do about it.

This post covers five patterns that work at scale. No theory, no vendor pitches — just the approaches that survive contact with real traffic.

If you're not yet caching prompts, read how to cut LLM API costs with prompt caching and model routing first — it directly affects your effective TPM headroom.

Key Takeaways

In Feb 2026, 60% of all LLM production errors were rate-limit errors (Datadog, State of AI Engineering 2026).

Standard RPM-based rate limiters fail for LLMs. You need token-aware buckets because a 10-token request and a 4,000-token request hit the same RPM counter but consume wildly different quota.

Prompt caching is a throughput multiplier: Anthropic caching cuts input token costs 90% and latency 85%. Cache hits don't drain your TPM quota.

Naive exponential backoff without jitter makes thundering herd worse. Full jitter is not optional.

Why Standard Rate Limiters Break on LLMs

Most backend engineers reach for a request-per-minute counter. It's the right tool for REST APIs. It's the wrong tool for LLMs.

LLM providers don't just cap RPM — they cap tokens per minute (TPM). A request with a 50-token prompt and a 200-token response barely moves the needle. A request with a 16,000-token context window and a 4,000-token response burns 20,000 TPM in one shot. Your RPM counter doesn't distinguish between the two.

Ethernet cables densely plugged into a server rack — representing backend networking infrastructure at scale

The gap between providers makes this worse. As of May 2026, OpenAI's Tier 4 allows 30 million input TPM. Anthropic's Tier 4 allows 400,000 input TPM — a 75x difference at the same tier level (DevTk.AI, AI API Rate Limits 2026). If you've load-tested against OpenAI and assumed similar headroom on Anthropic, you're in for a surprise.

Enterprise LLM spend doubled from $3.5B to $8.4B in six months during 2024–2025 (Menlo Ventures, 2025 Mid-Year LLM Market Update). That spend growth maps directly to traffic growth — and traffic growth is when your rate-limiting strategy either holds or collapses.

Pattern 1: Token-Aware Rate Limiting

Replace your RPM counter with a token bucket that tracks tokens, not requests.

python

import time
from threading import Lock

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: int):
        # capacity = max tokens, refill_rate = tokens per second
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate
        self.last_refill = time.monotonic()
        self._lock = Lock()

    def consume(self, tokens: int) -> bool:
        with self._lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False  # reject or queue

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now

_refill() calculates tokens earned since the last call: elapsed × refill_rate. The min(capacity, ...) is the safety valve — without it, a bucket that sits idle overnight accumulates unlimited tokens, and the morning's first burst would drain the entire allowance in one shot. consume() takes the lock first, then refills, then checks: that order matters because refilling after checking would reject requests that could have been served.

Before submitting a request, estimate the token cost. Use tiktoken for OpenAI models or the anthropic SDK's count_tokens(). Reject or queue requests that would overdraw the bucket.

Two buckets work better than one: a fast bucket (per-second) that prevents burst spikes, and a slow bucket (per-minute) that enforces the provider's TPM ceiling. A request must pass both. This matches how providers actually throttle — they have burst headroom but a hard minute-level ceiling.

Pattern 2: Priority-Lane Queuing for Multi-Tenant Systems

If you're serving multiple tenants or request types, a flat queue is fair but wrong. Interactive user requests and background batch jobs should not compete equally for capacity.

python

import heapq
import asyncio
from dataclasses import dataclass, field
from enum import IntEnum

class Priority(IntEnum):
    INTERACTIVE = 0   # user-facing, low latency required
    STANDARD    = 1   # internal, moderate SLO
    BATCH       = 2   # background jobs, best-effort

@dataclass(order=True)
class QueueItem:
    priority: Priority
    request: object = field(compare=False)

class PriorityLLMQueue:
    def __init__(self):
        self._queue = []
        self._ready = asyncio.Event()

    async def put(self, request, priority: Priority):
        heapq.heappush(self._queue, QueueItem(priority, request))
        self._ready.set()

    async def get(self) -> object:
        while not self._queue:
            self._ready.clear()
            await self._ready.wait()
        return heapq.heappop(self._queue).request

heapq is Python's built-in min-heap: heappush inserts by priority value, heappop always removes the lowest value first. Since INTERACTIVE = 0, it always wins over STANDARD = 1 and BATCH = 2. The asyncio.Event lets get() yield the event loop when the queue is empty instead of spinning — _ready.wait() suspends until something calls _ready.set() in put().

A 2026 paper on multi-tenant inference platforms (arxiv:2603.00356, Cunningham et al.) showed that token-pool admission control with priority scheduling maintained sub-1.2 second P99 TTFT during overload conditions. Without it, P99 degraded to 19+ seconds under the same load. The difference isn't the hardware — it's the scheduler.

In practice, the failure mode is always the same: batch jobs flood the queue during off-hours, a morning traffic spike hits, and interactive users get 15-second latencies. Separate queues with strict priority preemption fix this before it becomes a page.

For the quality side of the equation, see how to evaluate your LLM agent without lying to yourself.

Pattern 3: Backoff With Full Jitter (Not Exponential Alone)

When you hit a 429, naive exponential backoff looks like this: retry after 1s, 2s, 4s, 8s. It feels right. It's actually dangerous at scale.

If 500 concurrent requests all hit a rate limit at the same moment, naive exponential backoff retries all 500 at roughly the same time after 1 second. Then again at 2 seconds. You've turned one spike into a recurring thundering herd that hammers the provider in lockstep.

Full jitter breaks the synchronization:

python

import random
import asyncio

async def call_with_backoff(fn, max_retries=6, base_delay=1.0, cap=60.0):
    for attempt in range(max_retries):
        try:
            return await fn()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            # Full jitter: uniform random between 0 and capped exponential
            sleep = random.uniform(0, min(cap, base_delay * (2 ** attempt)))
            await asyncio.sleep(sleep)

base_delay * (2 ** attempt) doubles the max-wait ceiling each retry: attempt 0 → ceiling 1s, attempt 1 → 2s, attempt 4 → 16s. min(cap, ...) stops that growth — without it, attempt 9 calculates a 512s ceiling, meaning one unlucky request could hang for nearly 9 minutes. With cap=60, the ceiling stops growing at 60s regardless of retry count. random.uniform(0, ceiling) then picks any value in that range, so 500 simultaneous failures scatter their retries across the window instead of all retrying at second 1, then second 2, then second 4 in lockstep — which is exactly the thundering herd you're trying to prevent.

Token-pool admission control keeps P99 TTFT under 1.2s. Without it, sustained overload degrades P99 to 19+ seconds.

Pattern 4: Model Fallback Routing on Sustained Rate Limits

A single retry after a 429 is fine. Retrying the same model five times is not. After 2–3 consecutive 429s, route to a different model.

python

MODEL_PRIORITY = [
    "claude-opus-4-5",        # primary
    "claude-sonnet-4-5",      # fallback 1: cheaper, higher TPM
    "gpt-4o-mini",            # fallback 2: cross-provider
]

async def call_with_fallback(messages, **kwargs):
    for model in MODEL_PRIORITY:
        try:
            return await llm_call(model=model, messages=messages, **kwargs)
        except RateLimitError as e:
            if model == MODEL_PRIORITY[-1]:
                raise
            log.warning(f"Rate limited on {model}, trying next")
            continue

The list is tried in order. On RateLimitError, continue moves to the next model. The if model == MODEL_PRIORITY[-1]: raise re-raises on the last fallback so the caller gets an explicit exception rather than a silent None. One thing to check before adding fallbacks: verify that gpt-4o-mini and your primary model produce equivalent output for your specific task. A silent quality downgrade is harder to debug in production than an explicit RateLimitError.

As of 2026, 40% of production LLM teams have multi-provider routing in place, up from 23% ten months earlier (TianPan.co, March 2026). The jump was driven by multi-hour provider outages — but rate limit diversity is equally valuable.

The architectural implication of that 75x TPM gap between OpenAI Tier 4 and Anthropic Tier 4 is this: if you're under sustained rate pressure on Anthropic, routing to OpenAI isn't a quality degradation — it's relief valve access to 75x more headroom.

Rows of illuminated server racks inside a modern data center representing large-scale LLM inference infrastructure

Pattern 5: Prompt Caching as a Throughput Multiplier

This one's underused. Cache hits don't count against your TPM quota the same way fresh requests do — and for workloads with shared system prompts or repeated context, the throughput impact is substantial.

In January 2026, researchers at arxiv:2601.06007 (Don't Break the Cache, Shi et al.) found that 31% of LLM queries exhibit semantic similarity across sessions — meaning nearly a third of your input tokens are potential cache candidates. Their experiments showed caching cuts costs 41–80% and TTFT by 13–31% for long-horizon tasks.

Provider-level numbers are sharper. Anthropic's prompt caching (launched May 2025) delivers up to 90% cost reduction and 85% latency reduction for long-system-prompt workloads (Anthropic, prompt caching docs). Cache reads cost $0.30/M tokens vs $3.00/M for fresh processing.

python

# Anthropic: mark the static portion of your prompt as cacheable
response = client.messages.create(
    model="claude-opus-4-5",
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,  # 2,000+ tokens of static instructions
            "cache_control": {"type": "ephemeral"}  # <-- cache this block
        }
    ],
    messages=conversation_history,
    max_tokens=1024,
)

"cache_control": {"type": "ephemeral"} marks the block for a ~5-minute server-side cache. The first request pays full input token cost ($3.00/M on Claude Opus 4); subsequent requests in that window pay the cache read rate ($0.30/M). The block must be at least 1,024 tokens to qualify — shorter prompts are processed fresh without any error or warning, so if you're not seeing cache savings, check your token count first.

The effective throughput math: if your system prompt is 4,000 tokens and you make 1,000 requests/hour with a 70% cache hit rate, you're paying for 300,000 fresh input tokens instead of 4,000,000. That's a 92.5% reduction in effective TPM consumption — which pushes your actual rate-limit ceiling far higher than your provider tier implies.

Most teams treat caching and rate limiting as separate concerns. They aren't. Your effective TPM ceiling is (provider TPM limit) / (1 - cache_hit_rate). At 70% hit rate and a 400K TPM Anthropic limit, your effective ceiling is 1.33M TPM-equivalent throughput. Get the cache working before you beg for a higher tier.

The cache hit rate metric belongs on your LLM observability dashboard — track it alongside token cost per session.

What About Denial-of-Wallet?

If you're building a multi-tenant platform, rate limiting isn't just about protecting provider quotas. It's about protecting your bill.

A single runaway agent loop — misconfigured max iterations, no output token cap, recursive tool calls — can burn $500 in minutes. Add per-tenant cost budgets alongside TPM limits:

python

async def check_tenant_budget(tenant_id: str, estimated_cost: float) -> bool:
    daily_spend = await redis.get(f"spend:{tenant_id}:{today()}")
    limit = await get_tenant_daily_limit(tenant_id)  # e.g. $10/day
    if float(daily_spend or 0) + estimated_cost > limit:
        raise BudgetExceededError(f"Tenant {tenant_id} daily limit reached")
    await redis.incrbyfloat(f"spend:{tenant_id}:{today()}", estimated_cost)
    await redis.expire(f"spend:{tenant_id}:{today()}", 86400)
    return True

redis.incrbyfloat is atomic — two concurrent requests for the same tenant won't double-count. redis.expire(..., 86400) auto-resets the counter daily without a cron job. To estimate estimated_cost before the call: multiply your input token estimate by the model's per-token price, add expected output tokens, and buffer by 20% — models rarely hit exact max_tokens but output variance is real.

This isn't exotic. Gartner noted in March 2026 that agentic models consume 5–30x more tokens per task than standard chatbot interactions (Gartner, March 2026). Per-token costs are falling fast — but runaway agentic loops scale with token consumption, not price.

A developer's workstation with code visible on a laptop screen representing backend engineering work on rate limiting systems

Frequently Asked Questions

Should I rate-limit at the gateway or the application layer?

Both — for different reasons. The gateway enforces tenant quotas and provider-level budgets before requests hit your LLM client. The application layer handles retry logic, fallback routing, and request-level token estimation. Doing only one means either your tenants can bypass limits or your gateway doesn't have enough context to make good decisions.

What's the right queue depth before I start shedding load?

Calculate it from your latency SLO. If your SLO is 2-second P95 TTFT and average processing time is 1.2 seconds, your max queue depth is roughly (2.0 - 1.2) / 1.2 × concurrency. Beyond that, returning a 503 immediately is better than queuing — a request that waits 4 seconds and then succeeds is worse UX than an instant "try again in 10 seconds."

How do I handle rate limits from streaming responses mid-stream?

You can't retry a streaming response mid-flight. Design your streaming architecture to detect the 429 at connection time (before the stream starts) and fail fast. If the provider returns a 429 after the stream has started — rare but it happens — close the connection, log the partial response, and retry from the beginning with the same prompt, not a continuation.

Is token-aware rate limiting worth the complexity for small-scale apps?

Not until you have sustained multi-user traffic or long-context workloads (16K+ tokens per request). Below that, RPM counting is fine. The token-aware bucket earns its complexity when your request variance is high — mixed short/long prompts, agentic chains with variable tool outputs, RAG with unpredictable retrieved context lengths.

The Pattern Stack

These five patterns work in layers, not isolation:

Token-aware bucket — correct unit of measurement
Priority-lane queue — isolate interactive from batch
Full-jitter backoff — break thundering herd
Model fallback routing — escape rate limits via capacity diversity
Prompt caching — shrink effective TPM consumption before hitting limits

The teams that get this right don't treat rate limiting as an afterthought added when the first 429s appear. They build the token bucket first, wire in caching before launch, and add fallback routing after their first provider incident. That ordering matters.

For the broader production picture, see building AI agents: the engineer's guide.

Sources: Datadog State of AI Engineering 2026, datadoghq.com; Menlo Ventures 2025 Mid-Year LLM Market Update, menlovc.com; DevTk.AI AI API Rate Limits 2026, devtk.ai; Anthropic Prompt Caching docs, platform.claude.com; OpenAI Prompt Caching, openai.com; arxiv:2603.00356 Cunningham et al. March 2026; arxiv:2601.06007 Shi et al. January 2026; TianPan.co LLM API Resilience in Production March 2026; Gartner LLM Inference Cost Forecast March 2026

AI Agents