eazyware
Engineering·August 28, 2023·10 min read

Load testing AI systems: patterns and pitfalls

Token-aware load generators, rate limit probing, concurrency exploration. Load testing AI differs meaningfully from load testing traditional APIs.

KR
Kushal R.
Engineering lead

Load testing AI systems differs from load testing traditional APIs. Tokens per second matters more than RPS; streaming TTFT is a first-class metric; provider rate limits need probing; costs accumulate during tests. This post is the patterns for AI load testing that produce useful results without surprise bills or production incidents.

Traditional vs AI
Load testing AI — differences Traditional load testing RPS targets Latency distributions Error rates under load Connection limits AI load testing Tokens per second, not just RPS TTFT + tokens/sec for streaming Provider rate limits probe Cost per test run Practices Generate realistic prompts and contexts — random strings don't represent prod Test in isolated env or careful accounts — don't trigger prod rate limits
Traditional: RPS, latency, errors under load. AI: tokens/sec, TTFT + streaming, provider rate limits, cost per test.

What to measure

Tokens per second system-wide. Not just RPS. Multiple low-RPS requests may consume more tokens than high-RPS simple requests.

TTFT (time to first token). For streaming responses, how quickly does first token arrive? Often more user-relevant than total duration.

Tokens per second per stream. For streaming, throughput during the stream. Affects perceived responsiveness.

End-to-end latency. First token to last token. Total time to complete response.

Error rates at various loads. Provider rate limiting? Queue overflow? Timeout?

Realistic input generation

Don't use random strings. Produce realistic prompts representative of production traffic.

Sample from production data (sanitized). Use real patterns without real user data.

Distribution matters. Production has mix of short and long inputs; simple and complex queries. Load test should match.

Context size variation. Test with 1K, 10K, 100K token inputs. Performance can differ dramatically by context size.

Provider rate limits

Every provider has rate limits. Tokens per minute, requests per minute, concurrent requests. Vary by tier.

Probe limits carefully. Test your actual limits; don't assume. Each account may be different.

Test in isolated env. Triggering rate limits in production affects real users. Use separate accounts or staging.

Coordination with provider. For large tests, notify provider to avoid false positives on abuse detection.

Cost management during testing

Token costs accumulate. Hour-long load test at high throughput can consume thousands of dollars.

Budget per test. Set and enforce. Alert if spend exceeds.

Consider smaller models. Test with Haiku or smaller equivalent; extrapolate. Cheaper; often sufficient for load patterns.

Mock for non-functional tests. If testing application infrastructure under load (not actual AI), use mocked responses to avoid AI costs.

Load test scenarios

Baseline. Normal production traffic profile. Establishes reference point.

Spike. Sudden 3x, 5x, 10x traffic. Tests auto-scaling, rate limit behavior, graceful degradation.

Sustained peak. Extended duration at peak traffic. Tests sustained capacity, memory leaks, long-running stability.

Burst pattern. Alternating high and low traffic. Tests scale-up/scale-down responsiveness.

Provider outage. Simulate primary provider down; traffic shifts to secondary. Tests failover capacity.

Tools

k6. General-purpose load testing. Adapts well to AI workloads with custom scripts.

Locust. Python-based, flexible. Popular for custom AI load testing.

Artillery. Similar positioning. Good for HTTP-based AI API testing.

Custom scripts. Many teams write custom scripts for specific AI workflows. Flexibility wins over general tools for complex scenarios.

Interpreting results

p95, p99, not just average. Tail latency matters enormously for user experience.

By request type. Short queries vs long queries behave differently. Segment analysis.

By scenario. Baseline vs spike vs sustained behave differently. Compare within scenario, not across.

Threshold identification. At what load does quality of service degrade meaningfully? Informs capacity planning.

Integration with CI/CD

Pre-deploy load tests. Catch performance regressions before production.

Scheduled periodic tests. Weekly or monthly sustained peak tests. Catch drift.

Gate on results. Fail CI if load tests fail. Only deploy if performance maintained.

Read next
AI capacity planning: GPUs, tokens, and burst traffic
Read next
Chaos engineering for AI systems
Read next
SRE patterns for AI workloads
Tags
load testingperformancescaling
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request