Incident response plan template showing escalation flowchart and communication checklist for website downtime
# website monitoring

Incident Response Plan for Website Downtime: A Template

When your website goes down, chaos is optional. With a documented incident response plan, your team knows exactly who does what, in what order, to get back online as fast as possible.

This guide provides a ready-to-use incident response plan template that you can adapt for your organisation.

Why You Need an Incident Response Plan

Without a plan, the typical response to website downtime involves:

  • Multiple people independently discovering the issue and stepping on each other
  • Time wasted figuring out who should do what
  • Communication gaps leaving users without information
  • No clear "all clear" moment — the incident drags on

With a plan, your team moves from detection to resolution in a coordinated, efficient process that you can repeat and improve every time it's needed.

A plan doesn't need to be complex. A one-page document that everyone knows exists and knows where to find is far more valuable than an elaborate playbook nobody reads.

The Core Components

1. Detection

How you find out: Your uptime monitoring should be the primary detection mechanism — not customers, colleagues noticing, or chance.

Document:

  • What monitoring tools are in place
  • What alert channels are configured (SMS, Slack, email)
  • Who receives which alerts
  • What the on-call rotation is (if applicable)

Target detection time: < 2 minutes from failure start

2. Initial Response

Who responds: The on-call engineer (or primary responder if no rotation exists).

First 5 minutes:

  1. Acknowledge the alert (prevents escalation)
  2. Verify the issue is real (check from a different device/network)
  3. Assess scope: complete outage or partial degradation?
  4. Notify the team (Slack #incidents channel)
  5. Start a brief log noting the time and initial observations

Escalation trigger: If the primary responder can't make progress within 15 minutes, escalate to secondary.

3. Communication

Internal communication template:

🔴 INCIDENT STARTED — [Service Name]
Time detected: [HH:MM UTC]
Symptoms: [what users are experiencing]
Assigned to: [@engineer]
Status: Investigating

Update this every 15-30 minutes until resolved.

External communication:

Update your status page immediately with a brief acknowledgement: "We are aware of an issue affecting [service]. We are actively investigating and will provide updates every 30 minutes."

Update the status page every 30 minutes — even if there's no new information. Silence during an incident is more damaging than frequent updates saying "still investigating."

4. Investigation and Diagnosis

A structured investigation checklist:

  • When exactly did the failure start? (check monitoring dashboard)
  • What changed just before the failure? (deployments, config changes, infrastructure)
  • What does the error look like? (status code, error message, timeout?)
  • What do the application logs say around the start time?
  • Is the issue affecting all users or a subset?
  • Is the issue at our infrastructure or at a third party?
  • Is the issue growing, stable, or recovering?

Time correlation is your most important tool. The monitoring tool's timestamp tells you when the failure started — correlate this with deployment logs, infrastructure changes, and external service status.

5. Resolution

Resolution options in order of preference:

  1. Rollback — if a deployment caused the issue, roll back immediately
  2. Quick fix — a configuration change or restart that resolves the immediate issue
  3. Workaround — disable the broken functionality while a proper fix is prepared
  4. Escalate — if the issue is beyond your team's capacity, bring in additional expertise

After applying the fix:

  • Monitor recovery via your uptime monitoring tool — wait for monitors to return green
  • Don't declare resolution until monitoring confirms recovery
  • Update the status page: "The issue has been resolved. Service is operating normally."

6. All-Clear

When monitoring confirms full recovery:

  • Update all communication channels
  • Record the exact recovery time
  • Calculate total incident duration
  • Begin the post-incident review process

Incident Severity Levels

Classify incidents to prioritise response:

SeverityDefinitionResponse Time
P1 — CriticalComplete production outage; all users affectedImmediate (< 5 min)
P2 — HighMajor feature broken; significant user impact< 15 minutes
P3 — MediumPartial degradation; some users/features affected< 1 hour
P4 — LowMinor issue; minimal user impactNext business day

P1 and P2 warrant immediate SMS escalation and status page updates. P3 can be handled with Slack notification and less urgent communication. P4 is tracked but handled normally.

Roles During an Incident

For teams larger than 2-3 people, define clear roles:

RoleResponsibility
Incident CommanderOverall coordination, final decisions, keeps timeline
Technical LeadInvestigation and fix implementation
Communications LeadStatus page updates, customer communication
ScribeDocuments timeline in real-time

For small teams, one person covers multiple roles. What matters is that someone owns each function — especially communications, which often gets neglected during the technical scramble to fix the issue.

On-Call Rotation

If your service requires 24/7 coverage, document the on-call rotation:

  • Who is primary on-call this week?
  • Who is the secondary escalation?
  • How do they get paged? (Phone, PagerDuty, SMS)
  • What's the escalation timeout? (Typical: 5-10 minutes before escalating to secondary)

Alerting to the right people is covered in detail in the downtime alerts guide.

Post-Incident Review

After every P1 and P2 incident, schedule a post-mortem within 48-72 hours. See how to write a post-incident report for the full process and template.

The post-mortem closes the improvement loop: each incident makes the next one easier to handle.

Making the Plan Accessible

The best incident response plan is the one your team can find at 3am when the site is down:

  • Store it in your team wiki (Confluence, Notion, GitHub)
  • Share the link in your #engineering Slack channel
  • Include it in onboarding documentation for new engineers
  • Review and update it annually or after significant incidents

The foundation of any incident response plan is fast detection — set up monitoring at Domain Monitor.

More posts

What Is Generative AI? How It Works and What It Creates

Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.

Read more
What Is Cursor AI? The AI Code Editor Explained

Cursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.

Read more
What Is Claude Opus? Anthropic's Most Powerful Model Explained

Claude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.