Rate limiting for LLM-backed APIs is not a nice-to-have. One runaway script or malicious user can rack up five-figure bills in hours. One enterprise customer triggering a batch pipeline without coordination can starve your other users. The patterns are well-understood from classic API rate limiting with LLM-specific twists around token-level costs and tenant economics. This post is the layered approach we deploy for SaaS clients running multi-tenant AI.
Three bucket shapes
Token bucket: refill N tokens per second up to a capacity; allow bursts up to the full bucket. Good for UX where users occasionally send multiple quick requests. Users experience limits only when sustained.
Leaky bucket: constant drain rate regardless of arrival; no bursts above the rate. Strict throttling. Better for systems where predictable upstream load matters more than UX.
Sliding window: rolling time window (e.g., last 60 seconds), N requests allowed. No refill spikes, no burst allowance. The fairest UX but slightly more complex to implement efficiently.
Layered tenant-aware limits
Production systems need multiple layers of limits, checked in order with first-hit-wins:
Layer 1 — Per-tenant global cap. Matches contract terms and pricing tiers. Team plan: 100k tokens/day. Enforced regardless of which users within the tenant are driving traffic.
Layer 2 — Per-user cap within tenant. Prevents one user from consuming the tenant's entire allocation. Fair sharing across the team.
Layer 3 — Per-endpoint cap. Some endpoints are much more expensive. A vision-LLM endpoint might need tighter limits than a text classification endpoint. Protect expensive endpoints from disproportionate consumption.
Layer 4 — Global emergency brake. System-wide cap for catastrophic scenarios (bug causing infinite loop of requests, DDoS). Explicit kill-switch that disables AI endpoints entirely for a period.
Tokens vs requests
Traditional rate limiting caps requests per unit time. LLM systems should cap tokens (or effective cost) because requests vary wildly in cost. One document-summary request could be 30,000 input tokens; one chat turn could be 200. Capping at request-level lets the expensive one through and rate-limits the cheap ones.
Hybrid: a request limit AND a token limit. Request limit catches abuse (bot spam). Token limit catches cost (bulk processing). Both tripped independently.
Surfacing limits to users
429 response with Retry-After header (standard HTTP). Additional headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. This lets client apps back off gracefully.
In UI: communicate limits clearly. Users who understand their usage don't experience limits as surprises.
Edge cases
Streaming responses: the token count is only known after generation. Pre-count input tokens to pre-check before making the expensive call; post-count output tokens after streaming completes and update usage.
Retry storms: a user hitting 429 should back off, not retry immediately. Server-side: send Retry-After with increasing values on repeated hits to enforce backoff if client misbehaves.
Burst allowance: users sometimes have legitimate reasons for bursts (uploading a batch of docs). Consider allowing a small credit above sustained rate, with replenishment that requires time below average.
Implementation options
Redis-based distributed token bucket is the industry standard. Gateway middleware (Kong, Envoy, or application-layer middleware) implements the check. Commercial AI gateways (Portkey, Helicone, LiteLLM) include rate limiting. See AI gateway post.
For multi-tenancy specifics, see multi-tenancy post.