Blue-green deployment patterns adapt for AI systems by accounting for model version changes, prompt changes, and infrastructure updates as distinct change categories. Canary deploys, rollback triggers, and observability during deploys differ from traditional SaaS. This post is the specific patterns for deploying AI changes safely.

Change categories

Model version: Opus 4.6 → 4.7, canary 5%, watch evals. Prompt change: system prompt edit, A/B test. Infra update: vector store, cache, parallel deploy.

Model version changes

Canary deployment standard. Route 5% traffic to new model; observe for hours or days; expand gradually.

Metrics to watch. Eval pass rate, user feedback, cost per request, latency per request. All can change with model version.

Rollback triggers. Automated on major regression (>3-5% eval drop). Manual for smaller changes with judgment.

Full rollout timing. Days to weeks for significant model changes. Faster for minor updates.

Prompt changes

Version control prompts. Git-based. Every change reviewed, tested.

A/B test significant changes. Randomly assign users to new or old prompt; measure impact over days.

Canary for minor changes. 10% of traffic to new prompt; monitor; expand or rollback.

Quality guardrails. Primary metric on new prompt no worse than old; secondary metrics (cost, latency) within acceptable ranges.

Rollback mechanism. Instant reversion to previous prompt. No multi-step rollback.

Infrastructure changes

Vector store updates. Index schema changes, embedding model changes. Parallel deploy: new index populated alongside old; traffic shifts gradually.

Cache changes. New cache layer or policy. Parallel; validate cache hit rate and quality.

Serving infrastructure. Container images, orchestration changes. Standard blue-green with canary traffic.

Rollback criteria

Eval regression >3% in canary. Automatic rollback. Investigation before retry.

User feedback significantly worse. Thumbs down rate, complaints. Investigate, likely rollback.

Cost 20% higher than baseline. Alert; evaluate tradeoff. Sometimes acceptable for quality improvement; sometimes not.

Latency degraded significantly. User experience impact; typically rollback unless benefits clear.

Error rate elevated. Bugs in new version. Rollback while debugging.

Traffic shifting patterns

Percentage-based. 1% → 5% → 25% → 50% → 100% typical ramp. Time between steps depends on traffic volume and risk tolerance.

User cohort-based. Specific user segments get new version first. Internal users, then beta users, then general.

Geographic. New version deployed in one region first; expand globally.

Feature flag integration. Programmatic control over who sees what. Instant rollback without redeploying.

Observability during deploys

Real-time dashboards. Eval pass rate, user satisfaction, cost, latency — all tracked continuously during canary.

Comparison views. New version vs old version side by side.

Alert thresholds. Alerts fire on regression; deployer notified.

Rollback mechanism. Single command or click. Must be fast; incidents shouldn't take minutes to roll back.

Coordinating multiple changes

Avoid simultaneous changes. Deploy one change, observe, proceed to next. Co-deployed changes confound attribution when issues arise.

Change windows. Certain periods avoided (high-traffic days, holidays, major events).

Cross-team coordination. When multiple teams deploy concurrently, coordinate to avoid overlapping changes.

Post-deploy validation

Smoke tests. Automated validation of critical paths immediately after deploy.

Extended monitoring. First hours post-deploy watched especially closely.

Customer feedback channels. Support ticket volume; social media mentions. User-level feedback.

Post-deploy retro. For significant deploys, retro to capture learnings.

Blue-green deployment patterns for AI

Model version changes

Prompt changes

Infrastructure changes

Rollback criteria

Traffic shifting patterns

Observability during deploys

Coordinating multiple changes

Post-deploy validation

Continue the thread.

SRE patterns for AI workloads

AI quality monitoring in production

Want to talk about this?