AI capacity planning in 2026 is a different exercise from traditional compute planning. Token budgets, provider commitments, GPU fleet sizing, bursty traffic patterns, model routing decisions — all factor in. The math matters; being wrong in either direction (under or over) costs money and quality. This post is the framework we use for AI capacity.
Inputs
Traffic forecasts. Requests per hour, day, month. Projections based on growth rates, seasonality, planned launches. Three scenarios: best, base, worst case.
Token budgets per request. Input tokens (system prompt + user context + retrieval results) × output tokens (typical response length). Multiplied by request volume = total tokens.
Latency requirements. Which features need p95 <500ms? Which tolerate p95 <5s? Determines model choice, serving strategy.
Cost targets. Total AI cost budget; cost per user / per request targets; gross margin targets.
Token per request creep
Features add context over time. RAG retrieval grows; system prompts add tool definitions; examples multiply.
Token per request drifts up quarter over quarter. Unmonitored, this causes cost surprises.
Monitor and manage. Dashboard tokens/request by endpoint; investigate growth; optimize prompts. See prompt compression post.
Provider commitments
Anthropic, OpenAI, Google offer volume discounts and capacity guarantees against commitments. Large monthly commits translate to 15-40% discount plus reserved capacity.
Commit vs burst. Base load on commitment pricing; burst capacity at on-demand pricing. Right mix depends on traffic variability.
Commit risk. Miss commitment = pay anyway. Over-commit ties up capital. Undershoot leaves savings on the table.
Negotiate carefully. At sufficient volume ($50K+/month typical starting point), providers negotiate. Legal and procurement involvement.
Self-hosted capacity
GPU fleet sizing. Tokens per GPU per second × GPUs = total capacity. Peak capacity usually 2-3x average; plan for peaks.
Reserved vs on-demand GPU pricing. Reserved saves 30-60% but locks in. On-demand expensive at scale.
Auto-scaling. GPU auto-scaling takes time; not like web server auto-scaling. Plan for minutes, not seconds, of delay.
Capacity buffer. Plan for 20-30% buffer above forecast peak. Avoids emergencies when forecasts miss.
Model routing as capacity lever
Route simple queries to smaller models (cheap, fast). Route complex queries to larger models (expensive, slower). Capacity and cost optimized together.
Quality gates. If smaller model quality sufficient, use it. If not, escalate to larger. Classifier or rules decide.
Dynamic routing based on load. High traffic periods route more aggressively to smaller models; low periods use larger for all.
Bursty traffic patterns
B2B patterns. Monday morning burst (users return to work). End-of-month for finance workflows. Quarterly for sales tools.
Consumer patterns. Evening peaks. Weekend variations.
Global patterns. Follow-the-sun; peak shifts by time zone. Multi-region serving smooths.
Handling bursts. Auto-scaling, but lagged. Reserved capacity sized for burst, not average. Provider burst allowances on commitment tiers.
Common failure modes
Token creep uncaught. Costs rise; model thinks capacity adequate; actual capacity shrinks.
Peak-to-average ratio mismodeled. Reserved capacity wrong size; either over-paying or under-capacity at peak.
Model update changes cost profile. New model faster or cheaper per token; capacity math changes.
Feature launch surprises. New feature ships; token per request 3x; capacity plan wrong.
Practices
Quarterly capacity review. Rebaseline on actuals vs forecast; adjust for next quarter.
Pre-launch capacity check. New features estimated and capacity verified before launch.
Contingency plans. What happens if traffic 3x forecast? Model provider outage? Burst scenarios documented and testable.
Cost guardrails. Alerts when spending 20% above forecast. Investigate before month-end surprise.