eazyware
Engineering·October 16, 2023·11 min read

AI capacity planning: GPUs, tokens, and burst traffic

Forecasting AI capacity: GPU fleet sizing, token budgets, reserved vs on-demand, bursty traffic patterns. Capacity planning for 2026 AI workloads.

KR
Kushal R.
Engineering lead

AI capacity planning in 2026 is a different exercise from traditional compute planning. Token budgets, provider commitments, GPU fleet sizing, bursty traffic patterns, model routing decisions — all factor in. The math matters; being wrong in either direction (under or over) costs money and quality. This post is the framework we use for AI capacity.

Inputs, knobs, outputs
AI capacity planning — inputs and outputs Inputs Traffic forecasts Token budgets per request Latency requirements Knobs Provider commits Self-hosted GPU fleet Model routing Outputs Budget forecast Reserved % vs burst Scale-up thresholds Common failure modes Underestimating token-per-request creep as features add context Peak/off-peak ratio mismodeled — reserved capacity wrong Not accounting for model update cost/latency changes
Inputs: traffic forecasts, token budgets, latency requirements. Knobs: provider commits, GPU fleet, model routing. Outputs: budget, reserved %, scale thresholds.

Inputs

Traffic forecasts. Requests per hour, day, month. Projections based on growth rates, seasonality, planned launches. Three scenarios: best, base, worst case.

Token budgets per request. Input tokens (system prompt + user context + retrieval results) × output tokens (typical response length). Multiplied by request volume = total tokens.

Latency requirements. Which features need p95 <500ms? Which tolerate p95 <5s? Determines model choice, serving strategy.

Cost targets. Total AI cost budget; cost per user / per request targets; gross margin targets.

Token per request creep

Features add context over time. RAG retrieval grows; system prompts add tool definitions; examples multiply.

Token per request drifts up quarter over quarter. Unmonitored, this causes cost surprises.

Monitor and manage. Dashboard tokens/request by endpoint; investigate growth; optimize prompts. See prompt compression post.

Provider commitments

Anthropic, OpenAI, Google offer volume discounts and capacity guarantees against commitments. Large monthly commits translate to 15-40% discount plus reserved capacity.

Commit vs burst. Base load on commitment pricing; burst capacity at on-demand pricing. Right mix depends on traffic variability.

Commit risk. Miss commitment = pay anyway. Over-commit ties up capital. Undershoot leaves savings on the table.

Negotiate carefully. At sufficient volume ($50K+/month typical starting point), providers negotiate. Legal and procurement involvement.

Self-hosted capacity

GPU fleet sizing. Tokens per GPU per second × GPUs = total capacity. Peak capacity usually 2-3x average; plan for peaks.

Reserved vs on-demand GPU pricing. Reserved saves 30-60% but locks in. On-demand expensive at scale.

Auto-scaling. GPU auto-scaling takes time; not like web server auto-scaling. Plan for minutes, not seconds, of delay.

Capacity buffer. Plan for 20-30% buffer above forecast peak. Avoids emergencies when forecasts miss.

Model routing as capacity lever

Route simple queries to smaller models (cheap, fast). Route complex queries to larger models (expensive, slower). Capacity and cost optimized together.

Quality gates. If smaller model quality sufficient, use it. If not, escalate to larger. Classifier or rules decide.

Dynamic routing based on load. High traffic periods route more aggressively to smaller models; low periods use larger for all.

Bursty traffic patterns

B2B patterns. Monday morning burst (users return to work). End-of-month for finance workflows. Quarterly for sales tools.

Consumer patterns. Evening peaks. Weekend variations.

Global patterns. Follow-the-sun; peak shifts by time zone. Multi-region serving smooths.

Handling bursts. Auto-scaling, but lagged. Reserved capacity sized for burst, not average. Provider burst allowances on commitment tiers.

Common failure modes

Token creep uncaught. Costs rise; model thinks capacity adequate; actual capacity shrinks.

Peak-to-average ratio mismodeled. Reserved capacity wrong size; either over-paying or under-capacity at peak.

Model update changes cost profile. New model faster or cheaper per token; capacity math changes.

Feature launch surprises. New feature ships; token per request 3x; capacity plan wrong.

Practices

Quarterly capacity review. Rebaseline on actuals vs forecast; adjust for next quarter.

Pre-launch capacity check. New features estimated and capacity verified before launch.

Contingency plans. What happens if traffic 3x forecast? Model provider outage? Burst scenarios documented and testable.

Cost guardrails. Alerts when spending 20% above forecast. Investigate before month-end surprise.

Read next
Self-hosting vs managed: GPU decisions in 2026
Read next
AI cost attribution: who pays for what
Read next
Self-hosted LLM monitoring: the metrics that matter
Tags
capacity planningGPUsscaling
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request