Canary deployment is the rollout pattern that saves weekends. Instead of shipping a new model or prompt to 100% of traffic at once, you route 1% first, watch metrics, then gradually expand. The 2am pages you avoid more than pay for the additional deployment complexity. This post is the specific pattern we use for AI deployments, including the AI-unique metrics to watch and the auto-rollback triggers that catch problems before users notice.

Canary flow

1% to 10% to 50% to 100% with auto-gates. AI-specific metrics: eval pass rate, per-response cost, thumbs-up rate, p95 latency.

Why AI needs canary more than traditional software

Traditional software: bugs usually surface as errors (500s, stack traces). Canary monitoring catches them quickly.

AI: bugs often surface as quality regressions (subtly worse answers, occasionally wrong outputs). These don't show up in HTTP status codes. A full rollout to 100% can run for hours before users surface complaints. By then, thousands of poor outputs have shipped.

Canary catches this early. At 1% traffic, a quality regression affects fewer users and is detectable before reputational damage.

Stage gates

1% stage: 4 hours minimum. Automated metrics check; auto-promote if all green. This stage catches outright failures — the new model is completely broken, cost exploded, latency tripled.

10% stage: 24 hours. Broader sample catches effects invisible at 1%. Enough users that thumbs-up/down rates are statistically meaningful.

50% stage: 48 hours, team review. Human eyes on the metrics. Any subtle patterns the automated gates missed?

100% stage: monitoring continues for 72 hours post-full-rollout. Some regressions only manifest at full scale (queue dynamics, cache pressure, upstream rate limits).

AI-specific metrics to gate on

Eval pass rate on sampled live traffic. Periodically (every 15 minutes) sample N recent production requests; re-score them with your eval framework; track pass rate. Regression in pass rate > threshold triggers halt or rollback.

Per-response cost. Watch for unexpected token growth. A new model with different output patterns might produce 2x longer responses, doubling cost silently. Set cost-per-response alert thresholds.

User satisfaction signals. Thumbs-up rate, refund rate, escalation rate, session length. For chat systems, regeneration rate (user asking to try again) is a leading indicator of dissatisfaction.

Latency percentiles. p95 and p99 often move before p50. A new model might have a long-tail latency problem that shows up in p99 first. Tail latency regressions are user-visible even if averages look fine.

Auto-rollback

Thresholds for auto-rollback: any gating metric regressing > 20% from baseline. Or: error rate > 2% baseline + absolute threshold. Or: manual halt triggered by on-call engineer.

Rollback is a path you should test monthly. A rollback script that doesn't work because of a config drift is a worse problem than the original regression. See AI ops runbook.

Integration with feature flags

Canary deployments are typically implemented via feature flags. See feature flags post. Percentage rollout, user-cohort targeting, kill switches — all feature flag primitives.

Gateway-layer routing handles the actual traffic distribution. Half-percent of tenant_id hash goes to new model; rest stays on old. Incrementing the percentage is an alias change, not a deploy.

Who owns the canary?

The team shipping the change owns the canary. Not ops, not an SRE team. The team that made the change is best positioned to recognize unexpected behavior in their system. Ops provides the framework; the team operates it.

Canary ownership comes with responsibility: monitor the deployment during each stage, respond to gates, decide on promotion or rollback. Passing off to ops creates latency in response to problems.

Anti-patterns

Skipping canary for small changes. Small changes are exactly where canary is cheap and rollback is easy. Skip for truly insignificant changes (typo fix in a log message), not for any change that touches model behavior.

Manual-only gates. If gates require a human to approve progression, they're slow and human-dependent. Automate the quantitative gates; keep human judgment for qualitative assessment at the 50% stage.

Canary deployments for AI: the rollout pattern that saves weekends

Why AI needs canary more than traditional software

Stage gates

AI-specific metrics to gate on

Auto-rollback

Integration with feature flags

Who owns the canary?

Anti-patterns

Continue the thread.

Model versioning strategies for production AI

The AI-ops runbook: what to do when things break at 3am

AI incident response playbook

Want to talk about this?