Latency is the metric users experience but teams rarely budget for. Most projects measure end-to-end response time after the fact, realize it's too slow, and scramble to optimize. The alternative: a latency budget at the start, where each stage has a target, measurement is continuous, and regressions get caught at PR time. This post is the budgeting framework and the cost curves that inform how we allocate.
Why budget before building
Latency is emergent. Add 50ms of auth, 100ms of retrieval, 50ms of prompt-building, 600ms of LLM TTFB, 1000ms of token generation, 40ms of output guards — you're at 1840ms before you've done anything special. That's the optimistic case. In production, each stage has tail latency: p95 on retrieval might be 400ms instead of 100ms on a cold shard; LLM TTFB varies by 200ms depending on provider load.
Without a budget, teams discover this by shipping a prototype that's fine for a demo and 4 seconds slow in production. With a budget, each stage is designed to fit its allocation or negotiate explicitly with other stages.
Latency targets that matter
Different surfaces have different thresholds. From our deployments: conversational chat feels fast under 1.5s TTFB, acceptable to 3s, broken past 5s. Voice AI has a hard 1000ms TTFB wall and feels natural under 500ms — see our voice post. Autocomplete / copilot suggestions need <200ms TTFB to not interrupt typing. Async batch jobs can be seconds or minutes; the budget is throughput, not latency.
The stages and what each typically costs
Auth and rate limiting: 20-50ms. If this is >50ms you have a caching problem on your session store. Retrieval: 50-300ms depending on index size, hybrid search, and metadata filters. Most teams overspend here because they're over-retrieving — 50 candidates when 20 would do. Rerank: 50-200ms for cross-encoder on top-50 candidates. Prompt assembly: 10-50ms. If this is measurably slow, you're doing something weird. LLM TTFB: 300-800ms for the first token. Mostly out of your control but varies 2x between providers. LLM token generation: ~50-80 tokens/sec for frontier models, so a 500-token response is 6-10 seconds. Output guards: 30-100ms for the validation stack from our guardrails post.
Where to actually optimize
LLM TTFB is the single biggest lever and the hardest to move. Routing to a smaller model when the task permits saves 200-400ms on TTFB. Prompt caching (Anthropic's prompt caching, OpenAI's cache) saves 300ms+ on repeat context. See routing and semantic caching.
Total output time (tokens × speed) often dominates. Shorter outputs are faster outputs. System-prompt instructions to be concise, max_tokens caps that enforce discipline, and UI patterns that let users ask for "more" instead of generating long responses by default — all help.
Parallelize what you can. Retrieval and user's first-token need can often overlap with any pre-processing. Don't serialize calls that don't need to be serialized.
Measuring in production
Instrument each stage. Your observability stack should show p50/p95/p99 for auth, retrieval, rerank, LLM TTFB, output guards — not just total latency. When total latency regresses, you need to know which stage caused it. See our observability post for the tooling.
Alert on p95 per stage with thresholds 1.5-2x the budget. p50 tells you the normal case; p95 tells you how bad the bad day is; alert on both.