Post-Incident Monitoring Review Checklist

After an incident is resolved, most teams focus their post-mortem on the infrastructure failure — what broke and why. Fewer teams ask an equally important question: why didn't we know sooner?

Your monitoring setup is a system that can fail just like your application can fail. Alert contacts that no longer work, check frequencies too slow to detect short failures, monitors on the wrong endpoints, alerts that fire to a Slack channel nobody watches at 3am — these are monitoring failures that make infrastructure failures worse.

Use this checklist after any significant incident to find and close monitoring gaps.

Detection Timeline Review

When did the incident actually start? — determine the actual failure time from server logs, error logs, or external timeline data
When did your first monitoring alert fire? — record this timestamp
Calculate your MTTD (mean time to detect) for this incident — the gap between incident start and first alert. See what is mean time to detect
Was the monitoring MTTD acceptable? — for critical services, anything over 5 minutes warrants review
If the incident was discovered by users before monitoring alerted, investigate why — this is the most important finding

Alert Configuration Review

Did the correct monitors fire? — were the right endpoints being monitored?
Did any monitors fail to fire that should have? — gaps in endpoint coverage, wrong URLs, disabled monitors
Did any monitors fire too late? — check frequency may need to increase. See how to choose monitoring check frequency
Were there false negative checks? — monitors that showed healthy while the service was degraded (status code 200 but content wrong, or a health check that didn't reflect real application state)
Review content check configuration — are checks verifying meaningful content, or just HTTP status codes?

Alert Delivery Review

Did the right people receive alerts? — verify the alert contacts for the monitors involved
Were any alert contacts stale? — old phone numbers, former team members, unmaintained email addresses
What channel were alerts sent to? — if alerts went to email only, consider adding SMS for critical endpoints. See SMS alerts
Was anyone in a position to act on the alerts? — if the alert fired at 2am and went to a Slack channel, was anyone awake and watching it?
Was there an escalation path? — if the primary contact didn't respond, did the alert escalate to a backup?
How long between the alert firing and someone acknowledging it? — calculate MTTA (mean time to acknowledge)

Monitoring Coverage Gaps

Were all affected endpoints being monitored? — if the failure involved a service, API, or page that wasn't monitored, add it now
Were the right check types being used? — a failing database that returns 200 from a cached layer won't be caught by a simple HTTP check; a deep health check would have caught it
Were third-party dependencies monitored? — if a third-party service failure caused or contributed to the incident, add monitoring for that dependency. See how to monitor third-party API dependencies
Were response time alerts configured? — some incidents start as slowdowns before becoming outages; response time alerts provide earlier warning

Monitoring During the Incident

Was monitoring data useful during the incident? — did your team refer to the monitoring dashboard for impact assessment?
Were incident start and end times accurately recorded by your monitoring tool? — these timestamps matter for post-mortems and SLA reporting
Was a maintenance window set when the incident was being resolved? — maintenance windows prevent alert storms during active incident response. See maintenance windows
Was the status page updated during the incident? — see how to communicate website downtime for communication best practices

Action Items From This Review

For each gap identified, create a specific action item:

Gap Found	Action	Owner	By When
Example: API endpoint not monitored	Add monitor for /api/health	DevOps	Within 24 hours
Example: Alert contact stale	Update alert contacts	Team lead	This week
Example: 5-min interval too slow	Move to 1-min for checkout	DevOps	Within 24 hours

Updating Your Runbook

Document what you learned about the incident in your runbook or incident response documentation
Update your incident response procedure if the response process had gaps
Add a checklist item to your regular monitoring audit for any recurring issue pattern found. See monthly uptime audit checklist

Post-Mortem Integration

This monitoring review should feed into your broader post-incident report. See how to write a post-incident report for a complete post-mortem structure.

Domain Monitor provides accurate incident timestamps, alert delivery logs, and uptime history — the data you need for a thorough post-incident review. Create a free account.

What Is a Subdomain Takeover and How to Prevent It

A subdomain takeover lets an attacker claim your subdomain by exploiting dangling DNS records. Learn how it happens, real-world examples, and how DNS monitoring detects it.

What Is Mean Time to Detect (MTTD)?

Mean time to detect (MTTD) measures how long it takes to discover an incident after it starts. Reducing MTTD is one of the highest-leverage improvements in reliability engineering.

What Is Black Box Monitoring?

Black box monitoring tests your systems from the outside, the way users experience them — without access to internal code or infrastructure. Learn how it works and when to use it.

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.

View pricing & plans

Domain Monitoring

Uptime Monitoring

SSL Monitoring

WHOIS Lookup

Notifications

Status Pages

Ping test

Traceroute test

Find my website's IP

# website monitoring