How to Write a Post-Incident Report After Website Downtime

Every significant website outage deserves a post-incident report (also called a post-mortem or PIR). Without one, you lose the opportunity to learn from the incident, prevent recurrence, and build team knowledge.

A good post-incident report isn't about assigning blame — it's a structured process for understanding what happened and making systemic improvements. This guide provides a proven template and process.

Why Post-Incident Reports Matter

Organizations that consistently write post-mortem reports build progressively more reliable systems. They accumulate knowledge about failure modes, catch systemic weaknesses before they cause repeated incidents, and create documentation that helps new team members understand the system's history.

Organisations that skip post-mortems fix the immediate problem and repeat the same incident six months later.

The culture behind this is described as "blameless post-mortems" — pioneered at Google and now widely adopted in high-reliability engineering teams. The focus is on systems and processes, not individual mistakes.

When to Write a Post-Incident Report

Write a post-incident report for:

Any complete production outage lasting more than 5 minutes
Significant performance degradation affecting users
Security incidents (even if no downtime occurred)
Near-misses — incidents that almost happened but were caught in time
Failed deployments that required rollback

For minor incidents that were quickly resolved with no significant user impact, a brief log entry may be sufficient.

The Post-Incident Report Template

1. Incident Summary

A brief, factual summary anyone can understand:

Title: Database connection pool exhaustion causing 503 errors
Date: 2026-03-17
Duration: 47 minutes (14:23 – 15:10 UTC)
Severity: P1 — Complete outage
Services affected: All application endpoints
Author: [Engineer name]

2. Timeline

A chronological record of the incident from first failure to resolution. The monitoring system's timestamps are invaluable here — they give you the exact moment the failure started, which is often different from when it was detected.

14:23 — First external monitor check fails (detected by Domain Monitor)
14:24 — Second consecutive check fails; SMS alert sent to on-call engineer
14:26 — On-call engineer acknowledges alert, begins investigation
14:31 — Status page updated: "Investigating reports of service unavailability"
14:38 — Root cause identified: database connection pool at 100% capacity
14:45 — Temporary fix deployed: connection pool limit increased
14:52 — Services begin recovering; external monitors showing partial recovery
15:10 — Full recovery confirmed by external monitors
15:12 — Status page updated: "Issue resolved, service operating normally"

Tip: Your uptime monitoring tool provides the exact start time and recovery time. This is far more accurate than relying on when someone noticed the issue — especially if detection was delayed.

3. Root Cause Analysis

Describe what actually caused the incident. Use the 5 Whys technique to get to root cause rather than proximate cause:

Problem: Users receiving 503 errors

Why? Application cannot connect to database
Why? Database connection pool is exhausted (100/100 connections in use)
Why? Connections aren't being released properly
Why? A code change in the 14:15 deployment introduced a connection leak in the user authentication handler
Why? The code change was not reviewed against the connection lifecycle requirements; no integration test covered this path

Root cause: Missing integration test and code review gap allowed a connection leak into production

The root cause here is not "database connections exhausted" (that's the symptom) — it's the process gap that allowed the connection leak to reach production.

4. What Went Well

Even in bad incidents, something goes right. Documenting this reinforces good practices:

External monitoring detected the failure within 1 minute
On-call engineer was paged successfully and responded within 2 minutes
Status page was updated proactively, reducing inbound customer support tickets
Deployment rollback procedure worked as expected

5. What Went Wrong

An honest assessment of gaps:

Detection delay: 3 minutes elapsed between first failure and acknowledgment
Root cause identification took 12 minutes — longer than expected
Database connection metrics were available but not being monitored with an alert
The deployment pipeline didn't catch the connection leak in staging

6. Action Items

The most important section. Specific, assignable, time-bound improvements:

Action	Owner	Due Date	Priority
Add database connection pool alert (threshold: >80%)	@engineer	2026-03-20	P1
Add integration test for connection lifecycle in auth handler	@engineer	2026-03-24	P1
Update deployment runbook with connection pool check	@tech-lead	2026-03-31	P2
Review all authentication handlers for connection leaks	@team	2026-04-07	P2

Action items without owners and due dates don't get done. Be specific.

7. Lessons Learned

Broader insights for the team:

Our database monitoring coverage had a blind spot around connection pool utilisation
Staging environment wasn't running the same connection pool limits as production — this masked the issue
The blameless review process helped surface the process gap (missing integration test) rather than focusing on the individual who wrote the code

Process for Running a Post-Mortem

Schedule within 48-72 hours — while memory is fresh and emotions have cooled
Prepare the timeline — pull monitoring data, deployment logs, Slack history
Run a 60-minute blameless meeting — focus on systems, not people
Assign action items with owners and dates
Publish the report — share with the broader team or make available in your knowledge base
Follow up — track completion of action items in your next sprint

Connecting Post-Mortems to Monitoring Improvements

Post-incident reports frequently reveal monitoring gaps. After writing your report, ask:

Why didn't we detect this faster? → Improve detection: add monitors, reduce check interval, add content verification
Why didn't alerts reach the right people? → Improve alerting: review routing, add backup contacts
What metric would have predicted this failure? → Add proactive metrics alert

The what is incident management guide provides the broader framework within which post-incident reports sit.

Your uptime monitoring data provides the factual timeline foundation for every post-incident report. Get accurate incident timestamps at Domain Monitor.

How to Write a Post-Incident Report After Website Downtime

Why Post-Incident Reports Matter

When to Write a Post-Incident Report

The Post-Incident Report Template

1. Incident Summary

2. Timeline

3. Root Cause Analysis

4. What Went Well

5. What Went Wrong

6. Action Items

7. Lessons Learned

Process for Running a Post-Mortem

Connecting Post-Mortems to Monitoring Improvements

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.

Domain Monitoring

Uptime Monitoring

SSL Monitoring

WHOIS Lookup

Notifications

Status Pages

Ping test

Traceroute test

Find my website's IP

# website monitoring

How to Write a Post-Incident Report After Website Downtime

Why Post-Incident Reports Matter

When to Write a Post-Incident Report

The Post-Incident Report Template

1. Incident Summary

2. Timeline

3. Root Cause Analysis

4. What Went Well

5. What Went Wrong

6. Action Items

7. Lessons Learned

Process for Running a Post-Mortem

Connecting Post-Mortems to Monitoring Improvements

Related Articles

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.