Replay testing captures production requests and re-runs them against candidate models. Unlike shadow mode (which runs candidates in parallel with live production), replay runs them after the fact against historical traffic. Both patterns have their place; replay is cheaper, faster to iterate on, and catches a different class of issues. This post is the replay testing workflow we deploy and what it catches that evals miss.
Replay vs shadow mode
Shadow mode: live, synchronous, requires infra overhead, doubles cost during test. Captures exact current-moment distribution.
Replay: offline, batch, cheaper, can iterate freely on candidate models. Uses captured historical traffic.
For most iteration phases, replay is more practical. Engineer proposes a prompt change, runs replay against yesterday's 10K sampled requests, gets quality metrics in an hour. Shadow mode for that iteration loop would be days and expensive.
Reserve shadow mode for final validation before promotion. Use replay for exploratory iteration.
Architecture
Traffic sampler in the gateway captures 1-5% of production requests. Request body, metadata (tenant, endpoint, timestamp), and the actual production response are stored.
Request store: anonymized before storage. PII redacted. Retention bounded (typically 30-90 days). Access controlled — not every engineer can read raw request data.
Replay engine: reads a time-window of stored requests; executes each against a candidate model; stores candidate response. Parallelizable; typical replay of 10K requests completes in 20-40 minutes using batch inference APIs. See batch inference post.
Regression report: automated comparison of candidate responses vs stored production responses. Per-category breakdowns, diff examples, cost deltas.
What replay catches that evals miss
Real distribution of user queries. Your eval set has 300 cases; production saw 50,000 unique query patterns last week. Replay surfaces the edge cases and long-tail issues evals miss.
Long-tail patterns. Issues that affect 1 in 10,000 queries don't appear in curated eval sets but do matter at scale. Replay at sufficient scale exposes them.
Integration bugs. A new model might call tools differently, produce different structured outputs, interact differently with retrieval. These integration patterns are hard to capture in evals. Replay through the full request pipeline exposes them.
Cost shifts. Candidate produces 30% longer responses on average. Evals don't typically catch this (most evals don't track cost). Replay over real traffic shows the cost delta immediately.
PII handling is critical
Stored requests contain user data. Any replay system must handle this responsibly.
Redaction at capture time. PII scrubber on the sampler. Emails, phone numbers, names, account numbers, anything PII-adjacent. See PII redaction post.
Retention policies. 30 days default; 90 days for compliance investigation needs. Auto-delete after retention period.
Access control. Replay data is privileged. Engineers who need it request access with justification; access is logged.
Scoring replays
Per-request: similarity score between production and candidate responses, any structural divergence (JSON valid, fields matched), cost and latency deltas.
Aggregate: per-category pass rates, regression counts, cost summary. Sorted by impact so engineers see the biggest changes first.
LLM judge for qualitative: for responses that differ semantically, an LLM judge scores which is better. Imperfect but scales.
Feedback loop to evals
When replay surfaces a regression pattern, add it to your eval set. That way the next iteration catches the same issue at eval time rather than waiting for replay. See eval infrastructure post.
The eval set grows organically from real-world regressions. Over a year, your eval set reflects the actual distribution of edge cases in your system.
Rollout pattern
Most teams introduce replay after they have evals in place. Start with sample size (1% of traffic, 1000 requests for a quick test). Scale up as it proves useful. Mature systems run replay as part of CI — every prompt change triggers a replay against recent traffic before merge.