AI incident post-mortems have specific elements that don't show up in standard software post-mortem templates. Prompt and model details. Eval coverage analysis. Sample outputs. Upstream provider factors. This post is the template we use and what to include in each section — and what to think about publishing externally.
The structure
Summary. One paragraph. What happened, scope, customer impact, resolution. Enough to orient any reviewer without reading further.
Timeline. Detection, diagnosis, fix with timestamps. Not a blame ledger — a factual record of what happened when. Include tool outputs (dashboards, alerts) and human actions.
Root cause. For AI, this has specifics: what changed (prompt, retrieval, model, config)? Why didn't evals catch it? Was an upstream provider factor involved (model version change, API degradation)? Multiple root causes are common; list them all.
Mitigation and action items. Immediate fix applied. New eval case added. Systemic prevention with named owner and deadline. The eval case is the most important — converts an incident into a regression test.
AI-specific additions
Prompt diff (before/after). Shows exactly what changed, usually the core of the incident. Without this, readers can't reason about what went wrong.
Model version and config. Which model version was serving. Temperature, top_p, max_tokens settings. System prompt version. All relevant for diagnosis and prevention.
Sample failing outputs (redacted). Two or three representative examples of what went wrong. Makes the incident concrete for reviewers. Redact anything sensitive.
Eval coverage gap analysis. Which case was missing from evals that would have caught this? This is where the highest-value lesson lives. Adding that case prevents recurrence.
Blameless — always
Post-mortem is about systems, not people. 'Engineer X made a mistake' is not a useful root cause. 'The review process allowed a prompt change without running the eval suite' is. Focus on process and tooling.
Teams that blame produce post-mortems once, then nobody wants to touch incidents again. Teams that are genuinely blameless produce post-mortems routinely and get better every time.
Internal vs external
Always publish internally. Every incident. Searchable, linkable, findable. Future engineers learn from past incidents this way.
External publication depends. If customers noticed — or the incident was newsworthy — publish externally. Abridged version removing internal details but covering: what happened, who was affected, what we did, how we'll prevent recurrence.
External post-mortems are trust-building. Customers trust vendors who are transparent about failures far more than vendors who are silent. The short-term pain of acknowledgment is a long-term trust deposit.
Cadence and review
Post-mortem within 72 hours of incident resolution. Initial draft may be incomplete — that's fine; fill in as investigation proceeds. Review in the weekly eng meeting (or equivalent) to surface patterns across incidents.
Quarterly retrospective across all post-mortems: what themes repeat? Where do we keep getting burned? This drives larger systemic investments that single post-mortems miss.