Post-incident report template showing timeline, root cause analysis, and action items for website downtime
# website monitoring

How to Write a Post-Incident Report After Website Downtime

Every significant website outage deserves a post-incident report (also called a post-mortem or PIR). Without one, you lose the opportunity to learn from the incident, prevent recurrence, and build team knowledge.

A good post-incident report isn't about assigning blame — it's a structured process for understanding what happened and making systemic improvements. This guide provides a proven template and process.

Why Post-Incident Reports Matter

Organizations that consistently write post-mortem reports build progressively more reliable systems. They accumulate knowledge about failure modes, catch systemic weaknesses before they cause repeated incidents, and create documentation that helps new team members understand the system's history.

Organisations that skip post-mortems fix the immediate problem and repeat the same incident six months later.

The culture behind this is described as "blameless post-mortems" — pioneered at Google and now widely adopted in high-reliability engineering teams. The focus is on systems and processes, not individual mistakes.

When to Write a Post-Incident Report

Write a post-incident report for:

  • Any complete production outage lasting more than 5 minutes
  • Significant performance degradation affecting users
  • Security incidents (even if no downtime occurred)
  • Near-misses — incidents that almost happened but were caught in time
  • Failed deployments that required rollback

For minor incidents that were quickly resolved with no significant user impact, a brief log entry may be sufficient.

The Post-Incident Report Template

1. Incident Summary

A brief, factual summary anyone can understand:

Title: Database connection pool exhaustion causing 503 errors
Date: 2026-03-17
Duration: 47 minutes (14:23 – 15:10 UTC)
Severity: P1 — Complete outage
Services affected: All application endpoints
Author: [Engineer name]

2. Timeline

A chronological record of the incident from first failure to resolution. The monitoring system's timestamps are invaluable here — they give you the exact moment the failure started, which is often different from when it was detected.

14:23 — First external monitor check fails (detected by Domain Monitor)
14:24 — Second consecutive check fails; SMS alert sent to on-call engineer
14:26 — On-call engineer acknowledges alert, begins investigation
14:31 — Status page updated: "Investigating reports of service unavailability"
14:38 — Root cause identified: database connection pool at 100% capacity
14:45 — Temporary fix deployed: connection pool limit increased
14:52 — Services begin recovering; external monitors showing partial recovery
15:10 — Full recovery confirmed by external monitors
15:12 — Status page updated: "Issue resolved, service operating normally"

Tip: Your uptime monitoring tool provides the exact start time and recovery time. This is far more accurate than relying on when someone noticed the issue — especially if detection was delayed.

3. Root Cause Analysis

Describe what actually caused the incident. Use the 5 Whys technique to get to root cause rather than proximate cause:

Problem: Users receiving 503 errors

Why? Application cannot connect to database
Why? Database connection pool is exhausted (100/100 connections in use)
Why? Connections aren't being released properly
Why? A code change in the 14:15 deployment introduced a connection leak in the user authentication handler
Why? The code change was not reviewed against the connection lifecycle requirements; no integration test covered this path

Root cause: Missing integration test and code review gap allowed a connection leak into production

The root cause here is not "database connections exhausted" (that's the symptom) — it's the process gap that allowed the connection leak to reach production.

4. What Went Well

Even in bad incidents, something goes right. Documenting this reinforces good practices:

  • External monitoring detected the failure within 1 minute
  • On-call engineer was paged successfully and responded within 2 minutes
  • Status page was updated proactively, reducing inbound customer support tickets
  • Deployment rollback procedure worked as expected

5. What Went Wrong

An honest assessment of gaps:

  • Detection delay: 3 minutes elapsed between first failure and acknowledgment
  • Root cause identification took 12 minutes — longer than expected
  • Database connection metrics were available but not being monitored with an alert
  • The deployment pipeline didn't catch the connection leak in staging

6. Action Items

The most important section. Specific, assignable, time-bound improvements:

ActionOwnerDue DatePriority
Add database connection pool alert (threshold: >80%)@engineer2026-03-20P1
Add integration test for connection lifecycle in auth handler@engineer2026-03-24P1
Update deployment runbook with connection pool check@tech-lead2026-03-31P2
Review all authentication handlers for connection leaks@team2026-04-07P2

Action items without owners and due dates don't get done. Be specific.

7. Lessons Learned

Broader insights for the team:

  • Our database monitoring coverage had a blind spot around connection pool utilisation
  • Staging environment wasn't running the same connection pool limits as production — this masked the issue
  • The blameless review process helped surface the process gap (missing integration test) rather than focusing on the individual who wrote the code

Process for Running a Post-Mortem

  1. Schedule within 48-72 hours — while memory is fresh and emotions have cooled
  2. Prepare the timeline — pull monitoring data, deployment logs, Slack history
  3. Run a 60-minute blameless meeting — focus on systems, not people
  4. Assign action items with owners and dates
  5. Publish the report — share with the broader team or make available in your knowledge base
  6. Follow up — track completion of action items in your next sprint

Connecting Post-Mortems to Monitoring Improvements

Post-incident reports frequently reveal monitoring gaps. After writing your report, ask:

  • Why didn't we detect this faster? → Improve detection: add monitors, reduce check interval, add content verification
  • Why didn't alerts reach the right people? → Improve alerting: review routing, add backup contacts
  • What metric would have predicted this failure? → Add proactive metrics alert

The what is incident management guide provides the broader framework within which post-incident reports sit.


Your uptime monitoring data provides the factual timeline foundation for every post-incident report. Get accurate incident timestamps at Domain Monitor.

More posts

What Is Generative AI? How It Works and What It Creates

Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.

Read more
What Is Cursor AI? The AI Code Editor Explained

Cursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.

Read more
What Is Claude Opus? Anthropic's Most Powerful Model Explained

Claude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.