On-call for AI systems differs from traditional services. Quality regressions join availability as wake-up reasons. Provider dependencies complicate ownership. Cost anomalies need attention alongside errors. This post is the on-call practices we use for AI workloads — what to page on, what to auto-remediate, how to structure runbooks that work at 3am.

Alert tiers

SEV-1 (page): API down, p95 latency 3x, mass eval failure. SEV-2 (business hours): elevated errors, cost anomaly, quality drift. SEV-3 (ticket): partial degradation. Info (dashboard): trends.

What to page on

Provider API down. Anthropic, OpenAI, Google down = your product down. Page immediately; failover logic should fire automatically but human awareness matters.

p95 latency 3x SLO. Meaningful degradation affecting user experience. Requires investigation and likely remediation.

Mass eval failure. If your continuous eval suite regresses significantly on production model, quality issue affects users now. Page.

Error rate spike. Application errors above baseline, especially if growing. Investigate.

Cost anomaly. Spending 5x normal rate. Possibly bug, possibly abuse, possibly misconfiguration. Page to investigate.

What not to page on

Individual request failures. Retries should handle. Pattern matters, not individual events.

Minor latency variation. p95 within SLO bounds is normal. Don't page on noise.

New usage patterns. Information to track; not necessarily alerts.

Cost within forecast. Expected variation not alertable.

Auto-remediation patterns

Provider failover. Primary provider (Anthropic) errors exceed threshold; route to secondary (OpenAI). Monitor and page if failure persists, but initial response automated.

Rate limiting. Traffic spike triggers rate limiting; prevents cost explosion and provider throttling. Page if sustained.

Cache warming. After deployment or outage, automatically warm caches. See prompt cache warming post.

Model routing. If specific model degraded, route to fallback. Quality SLO maintained; notify team.

Runbooks that work at 3am

One runbook per alert type. Link from alert to runbook directly. Don't make on-call search for documentation.

Step-by-step: check X, if Y do Z, if A do B. Decision tree, not prose. Tired humans need structure.

Include diagnostic queries, kubectl commands, dashboard URLs. Direct links to what on-call needs.

Escalation paths. When to wake up senior engineers, when to page exec, when to declare incident. Clear thresholds.

Update after incidents. Every incident improves runbook. Retros feed into documentation.

Handling provider dependencies

You don't own the provider. Their status pages, their RCA, their SLA. You own communication to your customers.

Customer comms during provider outage. Status page, Twitter/X, proactive emails. Acknowledge impact even when root cause isn't yours.

SLO accounting. Provider outage hits your availability. Either absorb or negotiate credit against provider SLA.

Multi-provider architecture. If uptime is critical, run primary-secondary across providers. More complex but essential for enterprise SLOs.

Rotation structure

Primary and secondary on-call. Primary handles pages; secondary backs up for escalation. Rotation weekly typical.

AI-specific on-call skills. Engineers on rotation need to understand model serving, eval systems, cost tracking. Traditional SRE skills insufficient.

Cross-training. AI engineers learn on-call; SREs learn AI systems. Sustainable rotation requires shared capability.

Compensation. Paid on-call time is industry norm. Burn-out is real; compensate and rotate appropriately.

Postmortem practices

Blameless culture. Root causes, not finger-pointing. System issues, not individual mistakes.

Timeline reconstruction. Precise event sequence. What happened when? Often surprising.

Action items with owners. Each fix has an owner, deadline, follow-up mechanism. Things that don't get owned don't get done.

Share broadly. Postmortems are learning. Distribute across engineering, even for customers when appropriate.

On-call health metrics

Pages per rotation. Trends over time. Rising pages per week = system deterioration or expanded scope; investigate.

Time to acknowledge, time to mitigate, time to resolve. Tracks responsiveness at each step.

Action item completion rate. Postmortem actions follow through or drift.

On-call fatigue. Survey team quarterly. Burnout predicts turnover.

On-call practices for AI systems

What to page on

What not to page on

Auto-remediation patterns

Runbooks that work at 3am

Handling provider dependencies

Rotation structure

Postmortem practices

On-call health metrics

Continue the thread.

SRE patterns for AI workloads

LLM observability without vendor lock-in

AI capacity planning: GPUs, tokens, and burst traffic

Want to talk about this?