eazyware
Engineering·September 4, 2023·11 min read

A/B testing AI features at scale

Statistical power for probabilistic systems, guardrail metrics, sequential testing. How to run meaningful experiments on AI features.

KR
Kushal R.
Engineering lead

A/B testing AI features at scale requires statistical rigor adapted for probabilistic systems. Larger samples for same statistical power; guardrail metrics to avoid catastrophic deploys; sequential testing for fast iteration; holdout cohorts to measure long-term effects. This post is how to run meaningful AI A/B tests without either false positives or paralysis-by-caution.

Key patterns
A/B testing AI at scale Statistical power — probabilistic systems need larger samples Guardrail metrics — watch latency, cost, errors during experiment Sequential testing — check frequently, stop early on strong signals Cohort holdout — long-term effects measured via never-treated cohort Tools: Eppo, Statsig, Optimizely, homegrown on Snowflake/BigQuery
Statistical power: probabilistic systems need larger samples. Guardrails: watch latency, cost, errors. Sequential: check often, stop early. Cohort holdout: long-term effects via never-treated.

Statistical power for probabilistic systems

AI outputs vary per request. Same input can produce different outputs; same user can have different experiences across sessions.

Higher variance means larger samples for same statistical power. Typical SaaS A/B test needs thousands of users; AI A/B test may need tens of thousands.

Power calculations. Use proper power calculation tools (Evan Miller's calculator, or tool-specific calculators). Don't eyeball sample size.

Effect size considerations. Smaller effects require more samples. If you're measuring 1% lift, you need much larger sample than for 10% lift.

Guardrail metrics

Latency. Treatment arm must not dramatically slow experience. Alert if p95 latency degrades.

Cost. Treatment may use more expensive model or longer responses. Alert if cost per user rises significantly.

Error rate. Treatment may produce more errors. Alert if error rate exceeds baseline + tolerance.

User satisfaction proxy. Thumbs down rate, complaint rate. Alert if quality regresses visibly.

Guardrails trigger early stop. If guardrails breach, stop experiment, even if primary metric improving.

Sequential testing

Classical A/B tests require predetermined sample size. Inflexible when reality changes.

Sequential testing methods. Always-valid p-values, mSPRT, group sequential tests. Allow looking at data during experiment without inflating false positive rate.

Practical effect. Can stop early if strong signal; continue longer if signal ambiguous. Reduces experiment duration typically.

Tools. Optimizely, Statsig, Eppo all support sequential methods. DIY implementations viable for sophisticated teams.

Cohort holdout for long-term effects

Some effects only show over time. User retention, engagement quality, compound effects.

Hold out cohort. 5-10% of users never see new feature, ever. Acts as long-term baseline.

Compare treatment users at 30, 60, 90 days to holdout. Long-term effects become measurable.

Cost. Revenue forgone for holdout users. Worth it for systemic understanding.

Practical examples

Model version A/B. New model vs old. Primary: quality (user rating). Guardrails: latency, cost. Sequential testing common.

Prompt A/B. New system prompt vs old. Primary: task completion rate. Guardrails: cost (token usage), error rate.

Feature A/B. Show AI suggestion vs not. Primary: user action completion. Guardrails: session duration, errors.

UX A/B. How to present AI output. Primary: user engagement. Guardrails: subjective ratings, complaints.

Common pitfalls

Peeking. Looking at results before predetermined sample size without using sequential methods. Inflates false positive rate.

Multiple comparisons. Running many tests; some appear significant by chance. Use Bonferroni correction or FDR methods.

Variance ignorance. Assuming low variance where AI systems produce high variance. Under-powered tests yield unclear results.

Guardrail neglect. Only measuring primary metric; ignoring side effects on cost, latency, errors. Deploying 'winner' that's actually worse for business.

Organizational practices

Experiment platform. Centralized service for running experiments. Shared tooling across teams.

Experiment review. Weekly review of active and recently completed experiments. Calibration, knowledge sharing.

Culture of experimentation. Default to A/B testing for user-visible changes. Results, not opinions, drive decisions.

Sanity checks. Periodic A/A tests (both arms identical) to validate infrastructure. Catch bugs in experiment platform.

Read next
AI quality monitoring in production
Read next
AI usage analytics: what to measure, how to act on it
Read next
Why evaluation infrastructure matters more than prompts
Tags
A/B testingexperimentationstatistics
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request