eazyware
Ops·December 23, 2024·10 min read

AI incident post-mortems: what to learn, what to publish

AI incidents have different post-mortem patterns than software incidents. The template we use and what to include.

KR
Kushal R.
Engineering lead

AI incident post-mortems have specific elements that don't show up in standard software post-mortem templates. Prompt and model details. Eval coverage analysis. Sample outputs. Upstream provider factors. This post is the template we use and what to include in each section — and what to think about publishing externally.

Template
AI incident post-mortem template 1. Summary One paragraph · what happened · scope · customer impact · resolution 2. Timeline Detection, diagnosis, fix with timestamps (not a blame ledger) 3. Root cause (with AI specifics) · Prompt, retrieval, model, or config · Why evals didn't catch · Upstream provider factor? 4. Mitigation + action items · Immediate fix applied · New eval case added · Systemic prevention (with owner) AI-specific additions to standard post-mortems · Prompt diff (before/after) · model version + config used · Sample failing outputs (redacted) to build intuition for reviewers · Eval coverage gap analysis — which case was missing · Blameless · always publish internally · publish externally if customers noticed
Summary → timeline → root cause with AI specifics → mitigation and action items. Plus AI-specific additions: prompt diff, model version, sample failing outputs, eval coverage gap.

The structure

Summary. One paragraph. What happened, scope, customer impact, resolution. Enough to orient any reviewer without reading further.

Timeline. Detection, diagnosis, fix with timestamps. Not a blame ledger — a factual record of what happened when. Include tool outputs (dashboards, alerts) and human actions.

Root cause. For AI, this has specifics: what changed (prompt, retrieval, model, config)? Why didn't evals catch it? Was an upstream provider factor involved (model version change, API degradation)? Multiple root causes are common; list them all.

Mitigation and action items. Immediate fix applied. New eval case added. Systemic prevention with named owner and deadline. The eval case is the most important — converts an incident into a regression test.

AI-specific additions

Prompt diff (before/after). Shows exactly what changed, usually the core of the incident. Without this, readers can't reason about what went wrong.

Model version and config. Which model version was serving. Temperature, top_p, max_tokens settings. System prompt version. All relevant for diagnosis and prevention.

Sample failing outputs (redacted). Two or three representative examples of what went wrong. Makes the incident concrete for reviewers. Redact anything sensitive.

Eval coverage gap analysis. Which case was missing from evals that would have caught this? This is where the highest-value lesson lives. Adding that case prevents recurrence.

Blameless — always

Post-mortem is about systems, not people. 'Engineer X made a mistake' is not a useful root cause. 'The review process allowed a prompt change without running the eval suite' is. Focus on process and tooling.

Teams that blame produce post-mortems once, then nobody wants to touch incidents again. Teams that are genuinely blameless produce post-mortems routinely and get better every time.

Internal vs external

Always publish internally. Every incident. Searchable, linkable, findable. Future engineers learn from past incidents this way.

External publication depends. If customers noticed — or the incident was newsworthy — publish externally. Abridged version removing internal details but covering: what happened, who was affected, what we did, how we'll prevent recurrence.

External post-mortems are trust-building. Customers trust vendors who are transparent about failures far more than vendors who are silent. The short-term pain of acknowledgment is a long-term trust deposit.

Cadence and review

Post-mortem within 72 hours of incident resolution. Initial draft may be incomplete — that's fine; fill in as investigation proceeds. Review in the weekly eng meeting (or equivalent) to surface patterns across incidents.

Quarterly retrospective across all post-mortems: what themes repeat? Where do we keep getting burned? This drives larger systemic investments that single post-mortems miss.

Read next
AI incident response playbook
Read next
The AI-ops runbook: what to do when things break at 3am
Read next
LLM observability without vendor lock-in
Tags
post-mortemincidentslearning
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request