Load testing AI systems differs from load testing traditional APIs. Tokens per second matters more than RPS; streaming TTFT is a first-class metric; provider rate limits need probing; costs accumulate during tests. This post is the patterns for AI load testing that produce useful results without surprise bills or production incidents.
What to measure
Tokens per second system-wide. Not just RPS. Multiple low-RPS requests may consume more tokens than high-RPS simple requests.
TTFT (time to first token). For streaming responses, how quickly does first token arrive? Often more user-relevant than total duration.
Tokens per second per stream. For streaming, throughput during the stream. Affects perceived responsiveness.
End-to-end latency. First token to last token. Total time to complete response.
Error rates at various loads. Provider rate limiting? Queue overflow? Timeout?
Realistic input generation
Don't use random strings. Produce realistic prompts representative of production traffic.
Sample from production data (sanitized). Use real patterns without real user data.
Distribution matters. Production has mix of short and long inputs; simple and complex queries. Load test should match.
Context size variation. Test with 1K, 10K, 100K token inputs. Performance can differ dramatically by context size.
Provider rate limits
Every provider has rate limits. Tokens per minute, requests per minute, concurrent requests. Vary by tier.
Probe limits carefully. Test your actual limits; don't assume. Each account may be different.
Test in isolated env. Triggering rate limits in production affects real users. Use separate accounts or staging.
Coordination with provider. For large tests, notify provider to avoid false positives on abuse detection.
Cost management during testing
Token costs accumulate. Hour-long load test at high throughput can consume thousands of dollars.
Budget per test. Set and enforce. Alert if spend exceeds.
Consider smaller models. Test with Haiku or smaller equivalent; extrapolate. Cheaper; often sufficient for load patterns.
Mock for non-functional tests. If testing application infrastructure under load (not actual AI), use mocked responses to avoid AI costs.
Load test scenarios
Baseline. Normal production traffic profile. Establishes reference point.
Spike. Sudden 3x, 5x, 10x traffic. Tests auto-scaling, rate limit behavior, graceful degradation.
Sustained peak. Extended duration at peak traffic. Tests sustained capacity, memory leaks, long-running stability.
Burst pattern. Alternating high and low traffic. Tests scale-up/scale-down responsiveness.
Provider outage. Simulate primary provider down; traffic shifts to secondary. Tests failover capacity.
Tools
k6. General-purpose load testing. Adapts well to AI workloads with custom scripts.
Locust. Python-based, flexible. Popular for custom AI load testing.
Artillery. Similar positioning. Good for HTTP-based AI API testing.
Custom scripts. Many teams write custom scripts for specific AI workflows. Flexibility wins over general tools for complex scenarios.
Interpreting results
p95, p99, not just average. Tail latency matters enormously for user experience.
By request type. Short queries vs long queries behave differently. Segment analysis.
By scenario. Baseline vs spike vs sustained behave differently. Compare within scenario, not across.
Threshold identification. At what load does quality of service degrade meaningfully? Informs capacity planning.
Integration with CI/CD
Pre-deploy load tests. Catch performance regressions before production.
Scheduled periodic tests. Weekly or monthly sustained peak tests. Catch drift.
Gate on results. Fail CI if load tests fail. Only deploy if performance maintained.