A/B testing for LLM features is a trap for teams that apply traditional experimentation playbooks without adjustment. The underlying statistics break in ways that aren't immediately obvious. Results that look statistically significant often aren't; effects that look null actually matter. This post is the specific adjustments we make to run A/B tests that produce actionable insights, and the common failure modes that invalidate results.

Pitfalls and fixes

Traditional assumptions break: high output variance, user-level effects, novelty, multi-dimensional quality. Fixes: longer runs, user-level randomization, weekly tracking, composite scores.

Why traditional A/B breaks for LLMs

High output variance. The same LLM prompt produces different responses across calls (non-zero temperature). User-observed quality varies widely even within a fixed variant. This increases the sample size needed to detect effects.

User-level effects dominate session-level. A user who's frustrated with a new model for two sessions might churn; a randomly-assigned session-level test misses this pattern entirely. Effects need to be measured across users, over time, not within sessions.

Novelty effects. Users exposed to a new AI feature engage more for the first 1-2 weeks, then regress. Short-duration tests systematically overestimate value. What looks like a 15% lift at week 1 is often flat by week 4.

Multi-dimensional quality. Traditional A/B optimizes a single metric (conversion, revenue). AI feature quality is multi-dimensional — accuracy, latency, cost, user satisfaction. Optimizing one can regress another.

What to change

Run longer. A/B tests for AI features often need 2-4 weeks minimum. One week is rarely enough to see past novelty effects. Budget test duration accordingly — if you can't wait, don't A/B test; run offline evals instead.

Randomize at user level, not session level. A user sees the same variant across all their sessions. This captures the real effect of the experience over time.

Track metrics by week. Day 1 vs day 7 vs day 14 performance often tells a different story than the aggregate. A sustained effect at week 4 is worth more than a spike at week 1.

Use composite quality scores. Weight accuracy, latency, cost appropriately for your product. Report the components; decide on the composite. Single-metric optimization misleads.

Sample size reality

AI feature effect sizes are often small (1-5% improvement). Detecting 3% lift at 95% confidence with typical variance requires thousands of users per variant. If your product has 500 users, don't A/B test the details — run offline evals and ship.

A rough calculation: for 3% relative effect detection on a binary metric with 20% base rate, you need ~30K users per variant. For continuous metrics (latency, cost), sample sizes depend on variance but are similar order of magnitude.

Offline evals first

Before A/B testing, run offline evals. See eval infrastructure post. Offline evals answer 'does this new variant score better on our curated test set?' quickly and cheaply. Only variants that pass offline eval get A/B tested online.

This filters out variants that are obviously worse before you waste a week of A/B testing capacity on them. In practice, 70-80% of candidate variants never reach online A/B testing because they fail offline eval.

What to track

User-level: task completion rate, retention, session count, thumbs-up/thumbs-down. Request-level: quality score, latency, cost. Business: revenue, conversion, churn. AI-specific: hallucination rate, refusal rate, regeneration rate.

Instrument everything. You can't analyze what you didn't capture. Better to have more data than needed than realize after the fact that you missed a key metric.

Making the call

At test conclusion: does the new variant win on composite quality? If yes, is the effect stable across time and user segments? If yes, ship. If effects are mixed or user-segment-specific, consider targeted rollout instead of full. If effects are null, gather qualitative learning and move on.

A/B testing LLM features: the pitfalls that invalidate results

Why traditional A/B breaks for LLMs

What to change

Sample size reality

Offline evals first

What to track

Making the call

Continue the thread.

How to measure AI ROI without fooling yourself

Canary deployments for AI: the rollout pattern that saves weekends

Why evaluation infrastructure matters more than prompts

Want to talk about this?