
When your website goes down, chaos is optional. With a documented incident response plan, your team knows exactly who does what, in what order, to get back online as fast as possible.
This guide provides a ready-to-use incident response plan template that you can adapt for your organisation.
Without a plan, the typical response to website downtime involves:
With a plan, your team moves from detection to resolution in a coordinated, efficient process that you can repeat and improve every time it's needed.
A plan doesn't need to be complex. A one-page document that everyone knows exists and knows where to find is far more valuable than an elaborate playbook nobody reads.
How you find out: Your uptime monitoring should be the primary detection mechanism — not customers, colleagues noticing, or chance.
Document:
Target detection time: < 2 minutes from failure start
Who responds: The on-call engineer (or primary responder if no rotation exists).
First 5 minutes:
#incidents channel)Escalation trigger: If the primary responder can't make progress within 15 minutes, escalate to secondary.
Internal communication template:
🔴 INCIDENT STARTED — [Service Name]
Time detected: [HH:MM UTC]
Symptoms: [what users are experiencing]
Assigned to: [@engineer]
Status: Investigating
Update this every 15-30 minutes until resolved.
External communication:
Update your status page immediately with a brief acknowledgement: "We are aware of an issue affecting [service]. We are actively investigating and will provide updates every 30 minutes."
Update the status page every 30 minutes — even if there's no new information. Silence during an incident is more damaging than frequent updates saying "still investigating."
A structured investigation checklist:
Time correlation is your most important tool. The monitoring tool's timestamp tells you when the failure started — correlate this with deployment logs, infrastructure changes, and external service status.
Resolution options in order of preference:
After applying the fix:
When monitoring confirms full recovery:
Classify incidents to prioritise response:
| Severity | Definition | Response Time |
|---|---|---|
| P1 — Critical | Complete production outage; all users affected | Immediate (< 5 min) |
| P2 — High | Major feature broken; significant user impact | < 15 minutes |
| P3 — Medium | Partial degradation; some users/features affected | < 1 hour |
| P4 — Low | Minor issue; minimal user impact | Next business day |
P1 and P2 warrant immediate SMS escalation and status page updates. P3 can be handled with Slack notification and less urgent communication. P4 is tracked but handled normally.
For teams larger than 2-3 people, define clear roles:
| Role | Responsibility |
|---|---|
| Incident Commander | Overall coordination, final decisions, keeps timeline |
| Technical Lead | Investigation and fix implementation |
| Communications Lead | Status page updates, customer communication |
| Scribe | Documents timeline in real-time |
For small teams, one person covers multiple roles. What matters is that someone owns each function — especially communications, which often gets neglected during the technical scramble to fix the issue.
If your service requires 24/7 coverage, document the on-call rotation:
Alerting to the right people is covered in detail in the downtime alerts guide.
After every P1 and P2 incident, schedule a post-mortem within 48-72 hours. See how to write a post-incident report for the full process and template.
The post-mortem closes the improvement loop: each incident makes the next one easier to handle.
The best incident response plan is the one your team can find at 3am when the site is down:
#engineering Slack channelThe foundation of any incident response plan is fast detection — set up monitoring at Domain Monitor.
Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.
Read moreCursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.
Read moreClaude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.
Read moreLooking to monitor your website and domains? Join our platform and start today.