Post-incident monitoring review checklist showing MTTD analysis alert routing review and monitoring configuration improvements after downtime
# website monitoring

Post-Incident Monitoring Review Checklist

After an incident is resolved, most teams focus their post-mortem on the infrastructure failure — what broke and why. Fewer teams ask an equally important question: why didn't we know sooner?

Your monitoring setup is a system that can fail just like your application can fail. Alert contacts that no longer work, check frequencies too slow to detect short failures, monitors on the wrong endpoints, alerts that fire to a Slack channel nobody watches at 3am — these are monitoring failures that make infrastructure failures worse.

Use this checklist after any significant incident to find and close monitoring gaps.


Detection Timeline Review

  • When did the incident actually start? — determine the actual failure time from server logs, error logs, or external timeline data
  • When did your first monitoring alert fire? — record this timestamp
  • Calculate your MTTD (mean time to detect) for this incident — the gap between incident start and first alert. See what is mean time to detect
  • Was the monitoring MTTD acceptable? — for critical services, anything over 5 minutes warrants review
  • If the incident was discovered by users before monitoring alerted, investigate why — this is the most important finding

Alert Configuration Review

  • Did the correct monitors fire? — were the right endpoints being monitored?
  • Did any monitors fail to fire that should have? — gaps in endpoint coverage, wrong URLs, disabled monitors
  • Did any monitors fire too late? — check frequency may need to increase. See how to choose monitoring check frequency
  • Were there false negative checks? — monitors that showed healthy while the service was degraded (status code 200 but content wrong, or a health check that didn't reflect real application state)
  • Review content check configuration — are checks verifying meaningful content, or just HTTP status codes?

Alert Delivery Review

  • Did the right people receive alerts? — verify the alert contacts for the monitors involved
  • Were any alert contacts stale? — old phone numbers, former team members, unmaintained email addresses
  • What channel were alerts sent to? — if alerts went to email only, consider adding SMS for critical endpoints. See SMS alerts
  • Was anyone in a position to act on the alerts? — if the alert fired at 2am and went to a Slack channel, was anyone awake and watching it?
  • Was there an escalation path? — if the primary contact didn't respond, did the alert escalate to a backup?
  • How long between the alert firing and someone acknowledging it? — calculate MTTA (mean time to acknowledge)

Monitoring Coverage Gaps

  • Were all affected endpoints being monitored? — if the failure involved a service, API, or page that wasn't monitored, add it now
  • Were the right check types being used? — a failing database that returns 200 from a cached layer won't be caught by a simple HTTP check; a deep health check would have caught it
  • Were third-party dependencies monitored? — if a third-party service failure caused or contributed to the incident, add monitoring for that dependency. See how to monitor third-party API dependencies
  • Were response time alerts configured? — some incidents start as slowdowns before becoming outages; response time alerts provide earlier warning

Monitoring During the Incident

  • Was monitoring data useful during the incident? — did your team refer to the monitoring dashboard for impact assessment?
  • Were incident start and end times accurately recorded by your monitoring tool? — these timestamps matter for post-mortems and SLA reporting
  • Was a maintenance window set when the incident was being resolved? — maintenance windows prevent alert storms during active incident response. See maintenance windows
  • Was the status page updated during the incident? — see how to communicate website downtime for communication best practices

Action Items From This Review

For each gap identified, create a specific action item:

Gap FoundActionOwnerBy When
Example: API endpoint not monitoredAdd monitor for /api/healthDevOpsWithin 24 hours
Example: Alert contact staleUpdate alert contactsTeam leadThis week
Example: 5-min interval too slowMove to 1-min for checkoutDevOpsWithin 24 hours

Updating Your Runbook

  • Document what you learned about the incident in your runbook or incident response documentation
  • Update your incident response procedure if the response process had gaps
  • Add a checklist item to your regular monitoring audit for any recurring issue pattern found. See monthly uptime audit checklist

Post-Mortem Integration

This monitoring review should feed into your broader post-incident report. See how to write a post-incident report for a complete post-mortem structure.

Domain Monitor provides accurate incident timestamps, alert delivery logs, and uptime history — the data you need for a thorough post-incident review. Create a free account.


More posts

What Is a Subdomain Takeover and How to Prevent It

A subdomain takeover lets an attacker claim your subdomain by exploiting dangling DNS records. Learn how it happens, real-world examples, and how DNS monitoring detects it.

Read more
What Is Mean Time to Detect (MTTD)?

Mean time to detect (MTTD) measures how long it takes to discover an incident after it starts. Reducing MTTD is one of the highest-leverage improvements in reliability engineering.

Read more
What Is Black Box Monitoring?

Black box monitoring tests your systems from the outside, the way users experience them — without access to internal code or infrastructure. Learn how it works and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.