Rate limiting for LLM-backed APIs is not a nice-to-have. One runaway script or malicious user can rack up five-figure bills in hours. One enterprise customer triggering a batch pipeline without coordination can starve your other users. The patterns are well-understood from classic API rate limiting with LLM-specific twists around token-level costs and tenant economics. This post is the layered approach we deploy for SaaS clients running multi-tenant AI.

Rate limiting patterns

Token bucket for bursty UX, leaky bucket for strict throttling, sliding window for fair sharing. Tenant-aware layering stacks multiple limits in production.

Three bucket shapes

Token bucket: refill N tokens per second up to a capacity; allow bursts up to the full bucket. Good for UX where users occasionally send multiple quick requests. Users experience limits only when sustained.

Leaky bucket: constant drain rate regardless of arrival; no bursts above the rate. Strict throttling. Better for systems where predictable upstream load matters more than UX.

Sliding window: rolling time window (e.g., last 60 seconds), N requests allowed. No refill spikes, no burst allowance. The fairest UX but slightly more complex to implement efficiently.

Layered tenant-aware limits

Production systems need multiple layers of limits, checked in order with first-hit-wins:

Layer 1 — Per-tenant global cap. Matches contract terms and pricing tiers. Team plan: 100k tokens/day. Enforced regardless of which users within the tenant are driving traffic.

Layer 2 — Per-user cap within tenant. Prevents one user from consuming the tenant's entire allocation. Fair sharing across the team.

Layer 3 — Per-endpoint cap. Some endpoints are much more expensive. A vision-LLM endpoint might need tighter limits than a text classification endpoint. Protect expensive endpoints from disproportionate consumption.

Layer 4 — Global emergency brake. System-wide cap for catastrophic scenarios (bug causing infinite loop of requests, DDoS). Explicit kill-switch that disables AI endpoints entirely for a period.

Tokens vs requests

Traditional rate limiting caps requests per unit time. LLM systems should cap tokens (or effective cost) because requests vary wildly in cost. One document-summary request could be 30,000 input tokens; one chat turn could be 200. Capping at request-level lets the expensive one through and rate-limits the cheap ones.

Hybrid: a request limit AND a token limit. Request limit catches abuse (bot spam). Token limit catches cost (bulk processing). Both tripped independently.

Surfacing limits to users

429 response with Retry-After header (standard HTTP). Additional headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. This lets client apps back off gracefully.

In UI: communicate limits clearly. Users who understand their usage don't experience limits as surprises.

Edge cases

Streaming responses: the token count is only known after generation. Pre-count input tokens to pre-check before making the expensive call; post-count output tokens after streaming completes and update usage.

Retry storms: a user hitting 429 should back off, not retry immediately. Server-side: send Retry-After with increasing values on repeated hits to enforce backoff if client misbehaves.

Burst allowance: users sometimes have legitimate reasons for bursts (uploading a batch of docs). Consider allowing a small credit above sustained rate, with replenishment that requires time below average.

Implementation options

Redis-based distributed token bucket is the industry standard. Gateway middleware (Kong, Envoy, or application-layer middleware) implements the check. Commercial AI gateways (Portkey, Helicone, LiteLLM) include rate limiting. See AI gateway post.

For multi-tenancy specifics, see multi-tenancy post.

Rate limiting for LLM APIs: fair sharing and cost control

Three bucket shapes

Layered tenant-aware limits

Tokens vs requests

Surfacing limits to users

Edge cases

Implementation options

Continue the thread.

Multi-tenancy for AI applications: isolation patterns

Total cost of ownership for LLM systems

LLM security basics every team should know

Want to talk about this?