eazyware
Ops·December 8, 2025·10 min read

The AI-ops runbook: what to do when things break at 3am

Concrete response patterns for the seven AI-specific incidents, with exact first-five-minute actions.

KR
Kushal R.
Engineering lead

AI systems break in specific, repeating ways. Once you've been on-call for a few of these, the incident patterns become familiar. But most teams are on-call for their first AI-specific incident without a runbook. This post is that runbook — the six incident types we see most often and the exact first-five-minute response for each.

First 5 minutes
AI incident — first 5 minutes Cost spike (10x) 1. Pause batch jobs 2. Rate-limit top 1% users 3. Check for tool loops 4. Revert last deploy Quality regression 1. Check eval dashboard 2. Diff vs last good commit 3. Check provider version 4. Revert prompt or model Latency spike 1. Check provider status 2. Per-stage metrics 3. Route to alt provider 4. Graceful degradation Data leak 1. Kill affected endpoint 2. Rotate keys/tokens 3. Audit scope of leak 4. Notify per policy Tool-call storm 1. Cap max-steps globally 2. Check specific tool 3. Disable rogue tool 4. Fix loop logic Silent hallucination 1. Assume in-progress 2. Sample recent outputs 3. Tighten validator 4. Add case to eval set Common to all: screenshot dashboard → open incident → post in channel → 30-min post-mortem within 72h
Six incident categories with the specific first-five-minute actions for each. All share: screenshot dashboard, open incident, post in channel, post-mortem within 72h.

Before any incident: the minimum viable setup

You need three things before on-call rotation starts. (1) Dashboards showing per-endpoint p50/p95 latency, error rate, cost rate, eval pass rate. (2) Alert routing that pages the right person for the right severity — don't alert on noise. (3) Rollback mechanism — a single command or PR merge that reverts the last production change. If you don't have these, the first five minutes of every incident will be spent scrambling for basic visibility.

Incident: cost spike (10x or more)

First minute: confirm it's real. Is the dashboard showing actual spend, or stuck on an old value? Look at the underlying signal — tokens/sec, calls/sec. Second minute: identify the surface. Is one endpoint dominating? One user? One tenant? Third minute: pause non-critical. Batch jobs, background summaries, scheduled reindex — pause them all. Fourth minute: rate-limit the top 1% of users if a specific user isn't obviously the cause. Fifth minute: check for tool loops — one agent stuck in a retry storm can burn $1000/hour on a single user. If a deploy went out in the last hour, consider reverting.

Incident: quality regression

Detection usually comes from eval dashboards or user complaints. First minute: check the eval dashboard. Which metric dropped? By how much? Since when? Second minute: correlate with changes — prompt PRs, model upgrades, retrieval index rebuilds, provider version updates. Third minute: sample recent outputs. Pick 10 random production traces, read them, confirm the regression is real and find the failure mode. Fourth minute: diff the suspected change. Fifth minute: revert or hotfix. Don't debug in production — roll back, then debug.

Incident: latency spike

First minute: check provider status pages. OpenAI, Anthropic, Google — all publish real-time status. If they're degraded, you're propagating. Second minute: per-stage metrics. Is it retrieval? Embedding? LLM first-token? The dashboard should show where the time is going. Third minute: if the cause is provider-side, route to a fallback provider if you have multi-model routing set up. Fourth minute: if the cause is internal, check recent deploys. Fifth minute: enable graceful degradation — cached responses, simpler model, user-facing message about slow AI.

Incident: data leak

This is the one where speed matters most. First minute: kill the affected endpoint. Stop the bleed. Second minute: rotate any credentials that appeared in leaked output. Third minute: audit scope — how many users, how long has it been happening, what specifically leaked. Fourth minute: notify per your breach-notification policy (this varies by jurisdiction, have it pre-written). Fifth minute: begin the investigation properly. Don't skip the first minute — the instinct to 'investigate first' is wrong for data leaks.

Incident: tool-call storm

First minute: cap max-steps globally — an emergency environment variable or kill-switch. Second minute: identify the tool. Which specific tool is being called in the loop? Third minute: disable that tool. Better to lose a feature than burn money. Fourth minute: identify affected users; check if it's a specific query pattern. Fifth minute: plan the fix — usually a break condition the loop should have had.

Incident: silent hallucination

Hardest to detect, hardest to remediate. First minute: assume this has been happening for longer than you think. Second minute: sample recent outputs for the affected category. Third minute: tighten the relevant validator or guardrail. Fourth minute: consider whether to notify affected users — depends on domain and severity. Fifth minute: add the case to your eval set. This becomes the canary for the regression.

After every incident, a post-mortem within 72 hours. Blameless. What happened, why, what we'll change. Add the incident pattern to your eval suite. Over time, your eval suite becomes the accumulated memory of every way your system has broken — which is exactly what you want it to be.

Read next
AI incident response playbook
Read next
LLM observability without vendor lock-in
Read next
Guardrails and validators: keeping LLM outputs safe
Tags
opsincidentson-callrunbook
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request