eazyware
Engineering·October 14, 2024·10 min read

Rate limiting for LLM APIs: fair sharing and cost control

Token bucket, leaky bucket, per-user quotas, tenant-aware limits. Patterns that prevent one heavy user from ruining everyone else's experience.

KR
Kushal R.
Engineering lead

Rate limiting for LLM-backed APIs is not a nice-to-have. One runaway script or malicious user can rack up five-figure bills in hours. One enterprise customer triggering a batch pipeline without coordination can starve your other users. The patterns are well-understood from classic API rate limiting with LLM-specific twists around token-level costs and tenant economics. This post is the layered approach we deploy for SaaS clients running multi-tenant AI.

Rate limiting patterns
Rate limiting patterns for LLM APIs Token bucket N tokens/sec refill burst up to capacity smooth for bursty UX Leaky bucket constant drain rate no burst above drain strict throttle Sliding window rolling 60s quota no refill spikes fairest UX Tenant-aware layering (production) Layer 1: Per-tenant global cap (contracts, plans) Layer 2: Per-user cap within tenant (fair share) Layer 3: Per-endpoint cap (protect expensive endpoints) Layer 4: Global emergency brake (kill-switch for runaway costs) All layers checked; first-hit wins; return 429 with Retry-After header
Token bucket for bursty UX, leaky bucket for strict throttling, sliding window for fair sharing. Tenant-aware layering stacks multiple limits in production.

Three bucket shapes

Token bucket: refill N tokens per second up to a capacity; allow bursts up to the full bucket. Good for UX where users occasionally send multiple quick requests. Users experience limits only when sustained.

Leaky bucket: constant drain rate regardless of arrival; no bursts above the rate. Strict throttling. Better for systems where predictable upstream load matters more than UX.

Sliding window: rolling time window (e.g., last 60 seconds), N requests allowed. No refill spikes, no burst allowance. The fairest UX but slightly more complex to implement efficiently.

Layered tenant-aware limits

Production systems need multiple layers of limits, checked in order with first-hit-wins:

Layer 1 — Per-tenant global cap. Matches contract terms and pricing tiers. Team plan: 100k tokens/day. Enforced regardless of which users within the tenant are driving traffic.

Layer 2 — Per-user cap within tenant. Prevents one user from consuming the tenant's entire allocation. Fair sharing across the team.

Layer 3 — Per-endpoint cap. Some endpoints are much more expensive. A vision-LLM endpoint might need tighter limits than a text classification endpoint. Protect expensive endpoints from disproportionate consumption.

Layer 4 — Global emergency brake. System-wide cap for catastrophic scenarios (bug causing infinite loop of requests, DDoS). Explicit kill-switch that disables AI endpoints entirely for a period.

Tokens vs requests

Traditional rate limiting caps requests per unit time. LLM systems should cap tokens (or effective cost) because requests vary wildly in cost. One document-summary request could be 30,000 input tokens; one chat turn could be 200. Capping at request-level lets the expensive one through and rate-limits the cheap ones.

Hybrid: a request limit AND a token limit. Request limit catches abuse (bot spam). Token limit catches cost (bulk processing). Both tripped independently.

Surfacing limits to users

429 response with Retry-After header (standard HTTP). Additional headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. This lets client apps back off gracefully.

In UI: communicate limits clearly. Users who understand their usage don't experience limits as surprises.

Edge cases

Streaming responses: the token count is only known after generation. Pre-count input tokens to pre-check before making the expensive call; post-count output tokens after streaming completes and update usage.

Retry storms: a user hitting 429 should back off, not retry immediately. Server-side: send Retry-After with increasing values on repeated hits to enforce backoff if client misbehaves.

Burst allowance: users sometimes have legitimate reasons for bursts (uploading a batch of docs). Consider allowing a small credit above sustained rate, with replenishment that requires time below average.

Implementation options

Redis-based distributed token bucket is the industry standard. Gateway middleware (Kong, Envoy, or application-layer middleware) implements the check. Commercial AI gateways (Portkey, Helicone, LiteLLM) include rate limiting. See AI gateway post.

For multi-tenancy specifics, see multi-tenancy post.

Read next
Multi-tenancy for AI applications: isolation patterns
Read next
Total cost of ownership for LLM systems
Read next
LLM security basics every team should know
Tags
rate limitingquotafair usecost control
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request