Feature flags are a core primitive for AI deployment. Gradual rollouts, user-cohort targeting, kill switches on eval regressions — all feature flag capabilities. The difference between traditional feature flags and AI feature flags is mostly in the kill-switch and metric-triggered rollback patterns. This post is the flag architecture we deploy and the specific AI-relevant use cases that justify it.

Four layers

Per-user debug overrides, user-cohort targeting, percentage rollout, global kill switch. Kill switch is the critical safety primitive.

Four layers

Layer 1 — Per-user overrides. Engineering can force a specific variant for a specific user (themselves, a debug account). Essential for debugging but must be tightly scoped — overrides visible in audit logs, limited to engineers.

Layer 2 — User-cohort targeting. Beta users, tenants on a specific plan, customers in EU region. Targeted rollouts to specific groups before wider release.

Layer 3 — Percentage rollout. 1%, 10%, 50%, 100% of traffic routed to the new variant. See canary deployments post.

Layer 4 — Global kill switch. Instant off across all traffic. The critical safety feature. Eval-triggered auto-toggle, panic button for on-call engineers.

The kill switch is the critical primitive

When a new prompt or model causes issues in production, seconds matter. Rolling forward to find a fix takes minutes to hours. Rolling back via a feature flag takes seconds.

Our default: every AI feature has a kill switch. On-call engineer can disable within 10 seconds. Disabled state falls back to the previous behavior (previous prompt, previous model, or a gracefully-degraded response).

Automated kill switches: eval regression detection triggers the flag off. If live-traffic sampled evals show pass rate dropping below threshold for N consecutive windows, flag auto-flips to off. Engineer gets paged to investigate, not to do an emergency rollback.

AI-specific patterns

Prompt switching. The current production prompt is flagged; a candidate prompt is available behind the flag. Flip to promote; flip back to roll back.

Model routing. User-cohort-based routing (users on Free plan to cheaper model, Enterprise to premium). Implemented as a flag check in the gateway.

Feature disabling. Auto-summarize feature in docs product. Flag off when the summarization quality regresses; users see the feature as temporarily unavailable instead of broken output.

Guardrail toggling. New safety filter behind a flag. Enable for 10% of traffic; measure false-positive rate; adjust before wider rollout. See guardrails post.

Tooling options

LaunchDarkly (commercial), Unleash (OSS), Split (commercial). All have percentage rollouts, user targeting, kill switches, audit logs. Pick based on team size and budget.

Self-built flag systems are common and often justified. A simple Redis-backed flag service with a UI is 1-2 weeks of engineering. For small teams, this is often the right choice.

Key capabilities to insist on: sub-second flag propagation (kill switches depend on fast activation), audit trail (who flipped what when), backup/restore (flag state is critical config).

Flag lifecycle management

Flags accumulate. A codebase with 50+ dormant flags is a maintenance nightmare. Enforce lifecycle:

Create flag with expiration date (3-6 months typical). Regular audit: flags at 100% for 30+ days should be removed (cleanup PR). Flags at 0% for 60+ days should be removed or resurrected (feature abandonment check).

This discipline prevents flag sprawl. The cost of managing 500 flags is much higher than the cost of 50.

Auditability

Every flag change logged. Who, when, what changed. Immutable audit trail. In regulated environments, this log is part of compliance. In all environments, it's essential for incident investigation (why did behavior change on Thursday → someone flipped flag X).

Feature flags for AI: gradual rollouts and kill switches

Four layers

The kill switch is the critical primitive

AI-specific patterns

Tooling options

Flag lifecycle management

Auditability

Continue the thread.

Canary deployments for AI: the rollout pattern that saves weekends

The AI-ops runbook: what to do when things break at 3am

A/B testing LLM features: the pitfalls that invalidate results

Want to talk about this?