AI ROI claims are usually wrong. Not maliciously — more often overconfidently. A CRO announces 'we saved 10 hours per rep per week with AI'; the actual causal savings are 2.5 hours, and half of them get spent on other work that doesn't show up on the ledger. The difference between 'AI delivered value' and 'we sincerely believe AI delivered value' is measurement discipline. This post is the discipline.

From claim to honest ROI

Anecdotal claim → survey confirmation → observed timings → randomized holdout. Each step adds rigor; only the randomized holdout produces defensible causal estimates.

The measurement ladder

Level 1: Anecdotes

'My team says AI saves them time.' Good signal that something is working. Worthless for quantifying how much. Anecdotes are the start of measurement, not the finish.

Level 2: Surveys

'In a 100-person survey, reps said they save 6 hours per week.' Better than anecdote, still unreliable. Confirmation bias is large: people who received an AI tool from their company feel obligated to say it helped; people reporting time saved systematically over-estimate.

Use surveys for direction, not for dollar amounts. A survey showing 75% positive sentiment supports the story; the specific 6-hour number doesn't.

Level 3: Observed timings

Actually measure workflow times before and after AI rollout. If reps completed 8 calls per day before AI and 10 per day after, that's a 25% throughput increase. Real, quantified, defensible.

Limitation: correlation, not causation. Maybe 10 calls reflects a seasonal uptick that happened independently. Maybe the rollout included training that would have boosted numbers without AI. Observed timings are suggestive; they don't isolate the AI's contribution.

Level 4: Randomized holdout

Some reps get AI, some don't. Compare outcomes. This is causal. This is the only level that produces ROI numbers a skeptical CFO should believe.

Practical consideration: most companies resist 'denying' a productivity tool to some reps. Mitigations: phased rollout that creates natural holdout periods; geographic or team-based rollout with matched controls; time-shifted comparisons (same cohort, before/after, controlling for external trends).

At minimum, run the holdout for 6-8 weeks before declaring victory. Shorter periods get overwhelmed by noise. Also, measure outcomes (tickets resolved, deals closed) not just activity (time spent) — activity can go up without outcomes improving.

The metrics that matter

Leading indicators

Adoption rate (what fraction of eligible users actually use the AI). Intensity (how often, how deeply). Satisfaction (NPS on the AI feature). Time-to-outcome (how long from starting a workflow to completing it, with and without AI). These move quickly and signal whether deeper ROI is coming.

Lagging indicators

Revenue impact (for sales tools), ticket volume or CSAT (for support tools), cycle time (for ops tools), quality metrics (for content or analysis tools). These take longer to move, are subject to more confounders, but are the metrics that matter for ROI.

The denominator matters

Total cost including: model API, infrastructure, engineering to build, engineering to maintain, training for users, change management. Most teams understate by 40-60% by counting only model API. See our cost modeling post.

Common distortions

Time saved that becomes idle time. A rep saves 3 hours but doesn't use them for higher-value work — the ROI is zero. Measure what people do with saved time.

Quality degradation that's not measured. Faster output that's lower quality isn't savings; it's shifting costs elsewhere. Measure quality alongside speed.

Vanity metrics. 'AI handled 10,000 queries this month' means nothing without the comparison — would 10,000 queries have come in anyway? Did CSAT hold?

Over-attribution. All of a rep's improvement attributed to AI, when half is from the new process, the new tool, and better onboarding that came with the AI rollout.

The CFO test

Ask yourself: if a skeptical CFO reviewed our ROI claim, what would they challenge? Build the answer to every plausible challenge into the measurement methodology. If you can't answer the challenge, your number isn't defensible. If you can, ship the number with confidence.

How to measure AI ROI without fooling yourself

The measurement ladder

Level 1: Anecdotes

Level 2: Surveys

Level 3: Observed timings

Level 4: Randomized holdout

The metrics that matter

Leading indicators

Lagging indicators

The denominator matters

Common distortions

The CFO test

Continue the thread.

The anatomy of an AI project: phases, deliverables, pitfalls

Total cost of ownership for LLM systems

Pricing AI features in SaaS products

Want to talk about this?