A/B testing for LLM features is a trap for teams that apply traditional experimentation playbooks without adjustment. The underlying statistics break in ways that aren't immediately obvious. Results that look statistically significant often aren't; effects that look null actually matter. This post is the specific adjustments we make to run A/B tests that produce actionable insights, and the common failure modes that invalidate results.
Why traditional A/B breaks for LLMs
High output variance. The same LLM prompt produces different responses across calls (non-zero temperature). User-observed quality varies widely even within a fixed variant. This increases the sample size needed to detect effects.
User-level effects dominate session-level. A user who's frustrated with a new model for two sessions might churn; a randomly-assigned session-level test misses this pattern entirely. Effects need to be measured across users, over time, not within sessions.
Novelty effects. Users exposed to a new AI feature engage more for the first 1-2 weeks, then regress. Short-duration tests systematically overestimate value. What looks like a 15% lift at week 1 is often flat by week 4.
Multi-dimensional quality. Traditional A/B optimizes a single metric (conversion, revenue). AI feature quality is multi-dimensional — accuracy, latency, cost, user satisfaction. Optimizing one can regress another.
What to change
Run longer. A/B tests for AI features often need 2-4 weeks minimum. One week is rarely enough to see past novelty effects. Budget test duration accordingly — if you can't wait, don't A/B test; run offline evals instead.
Randomize at user level, not session level. A user sees the same variant across all their sessions. This captures the real effect of the experience over time.
Track metrics by week. Day 1 vs day 7 vs day 14 performance often tells a different story than the aggregate. A sustained effect at week 4 is worth more than a spike at week 1.
Use composite quality scores. Weight accuracy, latency, cost appropriately for your product. Report the components; decide on the composite. Single-metric optimization misleads.
Sample size reality
AI feature effect sizes are often small (1-5% improvement). Detecting 3% lift at 95% confidence with typical variance requires thousands of users per variant. If your product has 500 users, don't A/B test the details — run offline evals and ship.
A rough calculation: for 3% relative effect detection on a binary metric with 20% base rate, you need ~30K users per variant. For continuous metrics (latency, cost), sample sizes depend on variance but are similar order of magnitude.
Offline evals first
Before A/B testing, run offline evals. See eval infrastructure post. Offline evals answer 'does this new variant score better on our curated test set?' quickly and cheaply. Only variants that pass offline eval get A/B tested online.
This filters out variants that are obviously worse before you waste a week of A/B testing capacity on them. In practice, 70-80% of candidate variants never reach online A/B testing because they fail offline eval.
What to track
User-level: task completion rate, retention, session count, thumbs-up/thumbs-down. Request-level: quality score, latency, cost. Business: revenue, conversion, churn. AI-specific: hallucination rate, refusal rate, regeneration rate.
Instrument everything. You can't analyze what you didn't capture. Better to have more data than needed than realize after the fact that you missed a key metric.
Making the call
At test conclusion: does the new variant win on composite quality? If yes, is the effect stable across time and user segments? If yes, ship. If effects are mixed or user-segment-specific, consider targeted rollout instead of full. If effects are null, gather qualitative learning and move on.