AI incident post-mortems have specific elements that don't show up in standard software post-mortem templates. Prompt and model details. Eval coverage analysis. Sample outputs. Upstream provider factors. This post is the template we use and what to include in each section — and what to think about publishing externally.

Template

Summary → timeline → root cause with AI specifics → mitigation and action items. Plus AI-specific additions: prompt diff, model version, sample failing outputs, eval coverage gap.

The structure

Summary. One paragraph. What happened, scope, customer impact, resolution. Enough to orient any reviewer without reading further.

Timeline. Detection, diagnosis, fix with timestamps. Not a blame ledger — a factual record of what happened when. Include tool outputs (dashboards, alerts) and human actions.

Root cause. For AI, this has specifics: what changed (prompt, retrieval, model, config)? Why didn't evals catch it? Was an upstream provider factor involved (model version change, API degradation)? Multiple root causes are common; list them all.

Mitigation and action items. Immediate fix applied. New eval case added. Systemic prevention with named owner and deadline. The eval case is the most important — converts an incident into a regression test.

AI-specific additions

Prompt diff (before/after). Shows exactly what changed, usually the core of the incident. Without this, readers can't reason about what went wrong.

Model version and config. Which model version was serving. Temperature, top_p, max_tokens settings. System prompt version. All relevant for diagnosis and prevention.

Sample failing outputs (redacted). Two or three representative examples of what went wrong. Makes the incident concrete for reviewers. Redact anything sensitive.

Eval coverage gap analysis. Which case was missing from evals that would have caught this? This is where the highest-value lesson lives. Adding that case prevents recurrence.

Blameless — always

Post-mortem is about systems, not people. 'Engineer X made a mistake' is not a useful root cause. 'The review process allowed a prompt change without running the eval suite' is. Focus on process and tooling.

Teams that blame produce post-mortems once, then nobody wants to touch incidents again. Teams that are genuinely blameless produce post-mortems routinely and get better every time.

Internal vs external

Always publish internally. Every incident. Searchable, linkable, findable. Future engineers learn from past incidents this way.

External publication depends. If customers noticed — or the incident was newsworthy — publish externally. Abridged version removing internal details but covering: what happened, who was affected, what we did, how we'll prevent recurrence.

External post-mortems are trust-building. Customers trust vendors who are transparent about failures far more than vendors who are silent. The short-term pain of acknowledgment is a long-term trust deposit.

Cadence and review

Post-mortem within 72 hours of incident resolution. Initial draft may be incomplete — that's fine; fill in as investigation proceeds. Review in the weekly eng meeting (or equivalent) to surface patterns across incidents.

Quarterly retrospective across all post-mortems: what themes repeat? Where do we keep getting burned? This drives larger systemic investments that single post-mortems miss.

AI incident post-mortems: what to learn, what to publish

The structure

AI-specific additions

Blameless — always

Internal vs external

Cadence and review

Continue the thread.

AI incident response playbook

The AI-ops runbook: what to do when things break at 3am

LLM observability without vendor lock-in

Want to talk about this?