eazyware
Engineering·October 30, 2023·11 min read

On-call practices for AI systems

What to page on, what to auto-resolve, runbooks for AI incidents. On-call for AI workloads differs meaningfully from traditional services.

KR
Kushal R.
Engineering lead

On-call for AI systems differs from traditional services. Quality regressions join availability as wake-up reasons. Provider dependencies complicate ownership. Cost anomalies need attention alongside errors. This post is the on-call practices we use for AI workloads — what to page on, what to auto-remediate, how to structure runbooks that work at 3am.

Alert tiers
AI on-call — alert tiers SEV-1 (page): Model API down, p95 latency 3x SLO, mass eval failure SEV-2 (page business hours): Elevated error rate, cost anomaly, quality drift SEV-3 (ticket): Partial degradation, non-critical feature affected Info (dashboard): Usage spikes, new patterns, cost trends Runbooks per alert — auto-remediation for known failures, human for unknowns
SEV-1 (page): API down, p95 latency 3x, mass eval failure. SEV-2 (business hours): elevated errors, cost anomaly, quality drift. SEV-3 (ticket): partial degradation. Info (dashboard): trends.

What to page on

Provider API down. Anthropic, OpenAI, Google down = your product down. Page immediately; failover logic should fire automatically but human awareness matters.

p95 latency 3x SLO. Meaningful degradation affecting user experience. Requires investigation and likely remediation.

Mass eval failure. If your continuous eval suite regresses significantly on production model, quality issue affects users now. Page.

Error rate spike. Application errors above baseline, especially if growing. Investigate.

Cost anomaly. Spending 5x normal rate. Possibly bug, possibly abuse, possibly misconfiguration. Page to investigate.

What not to page on

Individual request failures. Retries should handle. Pattern matters, not individual events.

Minor latency variation. p95 within SLO bounds is normal. Don't page on noise.

New usage patterns. Information to track; not necessarily alerts.

Cost within forecast. Expected variation not alertable.

Auto-remediation patterns

Provider failover. Primary provider (Anthropic) errors exceed threshold; route to secondary (OpenAI). Monitor and page if failure persists, but initial response automated.

Rate limiting. Traffic spike triggers rate limiting; prevents cost explosion and provider throttling. Page if sustained.

Cache warming. After deployment or outage, automatically warm caches. See prompt cache warming post.

Model routing. If specific model degraded, route to fallback. Quality SLO maintained; notify team.

Runbooks that work at 3am

One runbook per alert type. Link from alert to runbook directly. Don't make on-call search for documentation.

Step-by-step: check X, if Y do Z, if A do B. Decision tree, not prose. Tired humans need structure.

Include diagnostic queries, kubectl commands, dashboard URLs. Direct links to what on-call needs.

Escalation paths. When to wake up senior engineers, when to page exec, when to declare incident. Clear thresholds.

Update after incidents. Every incident improves runbook. Retros feed into documentation.

Handling provider dependencies

You don't own the provider. Their status pages, their RCA, their SLA. You own communication to your customers.

Customer comms during provider outage. Status page, Twitter/X, proactive emails. Acknowledge impact even when root cause isn't yours.

SLO accounting. Provider outage hits your availability. Either absorb or negotiate credit against provider SLA.

Multi-provider architecture. If uptime is critical, run primary-secondary across providers. More complex but essential for enterprise SLOs.

Rotation structure

Primary and secondary on-call. Primary handles pages; secondary backs up for escalation. Rotation weekly typical.

AI-specific on-call skills. Engineers on rotation need to understand model serving, eval systems, cost tracking. Traditional SRE skills insufficient.

Cross-training. AI engineers learn on-call; SREs learn AI systems. Sustainable rotation requires shared capability.

Compensation. Paid on-call time is industry norm. Burn-out is real; compensate and rotate appropriately.

Postmortem practices

Blameless culture. Root causes, not finger-pointing. System issues, not individual mistakes.

Timeline reconstruction. Precise event sequence. What happened when? Often surprising.

Action items with owners. Each fix has an owner, deadline, follow-up mechanism. Things that don't get owned don't get done.

Share broadly. Postmortems are learning. Distribute across engineering, even for customers when appropriate.

On-call health metrics

Pages per rotation. Trends over time. Rising pages per week = system deterioration or expanded scope; investigate.

Time to acknowledge, time to mitigate, time to resolve. Tracks responsiveness at each step.

Action item completion rate. Postmortem actions follow through or drift.

On-call fatigue. Survey team quarterly. Burnout predicts turnover.

Read next
SRE patterns for AI workloads
Read next
LLM observability without vendor lock-in
Read next
AI capacity planning: GPUs, tokens, and burst traffic
Tags
on-callincident responseSRE
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request