Shadow mode testing is the practice of running a candidate model or prompt in parallel with production, silently, on real traffic. Outputs aren't returned to users — they're compared against the current production outputs offline. It's the gold standard for validating major AI changes before cutover, catching real-distribution issues that curated evals miss. This post is the architecture we deploy and the specific comparison metrics that matter.
Why shadow testing complements evals
Evals test curated cases. Production traffic is messier — edge cases you didn't think to add, distributional patterns that differ from your eval set, integration interactions with retrieval or tools that evals simplify away.
Shadow mode captures all of this because it runs on actual production traffic. Whatever users are asking today, the candidate model is answering in parallel. You see real-distribution behavior.
Evals and shadow mode are complementary, not substitutes. Evals catch regressions in known categories quickly (minutes). Shadow mode catches regressions in unknown categories over days of traffic. Both before cutover.
Architecture
Production path: user request to current production model to response to user (unchanged).
Shadow path: same request to async fork to candidate model to store candidate response in shadow log.
Async is critical. Shadow calls must not affect user-visible latency. Fire-and-forget with a queue; candidate model runs on a separate inference path.
Comparator: daily batch job reads shadow log; for each request, compares production output vs candidate output; computes similarity score, quality score, cost delta, latency delta.
What to compare
Semantic similarity. Embedding-based similarity between production and candidate outputs. High similarity means candidate is producing similar answers. Low similarity means candidate is meaningfully different (could be better or worse).
Output structure validity. For structured outputs (JSON, function calls), is the candidate producing valid output at the same rate? Regressions in validity are immediate red flags.
Key field accuracy. For extraction tasks, do specific fields (names, amounts, dates) match? Low field accuracy suggests the candidate is extracting differently in ways that matter.
Cost. Average tokens per response. A candidate that produces 2x longer outputs doubles cost — might be a feature or a regression depending on quality.
Latency. p95 and p99 for the candidate. Are we trading response quality for speed acceptably?
LLM-as-judge for quality scoring
For qualitative comparison, use an LLM judge. Prompt: given the input and two responses, which is better and why? Compare thousands of cases in batch at reasonable cost.
Judges are imperfect — they have their own biases (preferring longer answers, verbose explanations). Use them as one signal among several, not as ground truth. See eval infrastructure post.
When to promote
Our default gate: N days of shadow testing (typically 7-14), composite score staying within 5% of production across all dimensions, no regressions in specific critical categories.
If the candidate is meaningfully better (composite score 10%+ higher), that's also signal worth surfacing — this is the case for promoting the candidate.
If the scores are a wash, look at cost and latency. If the candidate is cheaper or faster with equivalent quality, promote.
Privacy and cost considerations
Shadow mode doubles your inference cost for the duration of the test. Budget accordingly.
Sampling: for high-volume systems, shadow at 10-20% instead of 100%. Still statistically significant for most metrics; cuts cost proportionally.
PII: shadow log contains production data. Same retention, access control, and redaction as production logs. See PII redaction post.
Pairs with canary for full rollout
Shadow mode validates before any user sees the candidate. Canary deployment validates at small scale once the candidate goes live. Together, they form a 3-stage rollout: offline evals, shadow mode, canary, full. See canary deployments post.