
Every significant website outage deserves a post-incident report (also called a post-mortem or PIR). Without one, you lose the opportunity to learn from the incident, prevent recurrence, and build team knowledge.
A good post-incident report isn't about assigning blame — it's a structured process for understanding what happened and making systemic improvements. This guide provides a proven template and process.
Organizations that consistently write post-mortem reports build progressively more reliable systems. They accumulate knowledge about failure modes, catch systemic weaknesses before they cause repeated incidents, and create documentation that helps new team members understand the system's history.
Organisations that skip post-mortems fix the immediate problem and repeat the same incident six months later.
The culture behind this is described as "blameless post-mortems" — pioneered at Google and now widely adopted in high-reliability engineering teams. The focus is on systems and processes, not individual mistakes.
Write a post-incident report for:
For minor incidents that were quickly resolved with no significant user impact, a brief log entry may be sufficient.
A brief, factual summary anyone can understand:
Title: Database connection pool exhaustion causing 503 errors
Date: 2026-03-17
Duration: 47 minutes (14:23 – 15:10 UTC)
Severity: P1 — Complete outage
Services affected: All application endpoints
Author: [Engineer name]
A chronological record of the incident from first failure to resolution. The monitoring system's timestamps are invaluable here — they give you the exact moment the failure started, which is often different from when it was detected.
14:23 — First external monitor check fails (detected by Domain Monitor)
14:24 — Second consecutive check fails; SMS alert sent to on-call engineer
14:26 — On-call engineer acknowledges alert, begins investigation
14:31 — Status page updated: "Investigating reports of service unavailability"
14:38 — Root cause identified: database connection pool at 100% capacity
14:45 — Temporary fix deployed: connection pool limit increased
14:52 — Services begin recovering; external monitors showing partial recovery
15:10 — Full recovery confirmed by external monitors
15:12 — Status page updated: "Issue resolved, service operating normally"
Tip: Your uptime monitoring tool provides the exact start time and recovery time. This is far more accurate than relying on when someone noticed the issue — especially if detection was delayed.
Describe what actually caused the incident. Use the 5 Whys technique to get to root cause rather than proximate cause:
Problem: Users receiving 503 errors
Why? Application cannot connect to database
Why? Database connection pool is exhausted (100/100 connections in use)
Why? Connections aren't being released properly
Why? A code change in the 14:15 deployment introduced a connection leak in the user authentication handler
Why? The code change was not reviewed against the connection lifecycle requirements; no integration test covered this path
Root cause: Missing integration test and code review gap allowed a connection leak into production
The root cause here is not "database connections exhausted" (that's the symptom) — it's the process gap that allowed the connection leak to reach production.
Even in bad incidents, something goes right. Documenting this reinforces good practices:
An honest assessment of gaps:
The most important section. Specific, assignable, time-bound improvements:
| Action | Owner | Due Date | Priority |
|---|---|---|---|
| Add database connection pool alert (threshold: >80%) | @engineer | 2026-03-20 | P1 |
| Add integration test for connection lifecycle in auth handler | @engineer | 2026-03-24 | P1 |
| Update deployment runbook with connection pool check | @tech-lead | 2026-03-31 | P2 |
| Review all authentication handlers for connection leaks | @team | 2026-04-07 | P2 |
Action items without owners and due dates don't get done. Be specific.
Broader insights for the team:
Post-incident reports frequently reveal monitoring gaps. After writing your report, ask:
The what is incident management guide provides the broader framework within which post-incident reports sit.
Your uptime monitoring data provides the factual timeline foundation for every post-incident report. Get accurate incident timestamps at Domain Monitor.
Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.
Read moreCursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.
Read moreClaude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.
Read moreLooking to monitor your website and domains? Join our platform and start today.