Incident Response Plan for Website Downtime: A Template

When your website goes down, chaos is optional. With a documented incident response plan, your team knows exactly who does what, in what order, to get back online as fast as possible.

This guide provides a ready-to-use incident response plan template that you can adapt for your organisation.

Why You Need an Incident Response Plan

Without a plan, the typical response to website downtime involves:

Multiple people independently discovering the issue and stepping on each other
Time wasted figuring out who should do what
Communication gaps leaving users without information
No clear "all clear" moment — the incident drags on

With a plan, your team moves from detection to resolution in a coordinated, efficient process that you can repeat and improve every time it's needed.

A plan doesn't need to be complex. A one-page document that everyone knows exists and knows where to find is far more valuable than an elaborate playbook nobody reads.

The Core Components

1. Detection

How you find out: Your uptime monitoring should be the primary detection mechanism — not customers, colleagues noticing, or chance.

Document:

What monitoring tools are in place
What alert channels are configured (SMS, Slack, email)
Who receives which alerts
What the on-call rotation is (if applicable)

Target detection time: < 2 minutes from failure start

2. Initial Response

Who responds: The on-call engineer (or primary responder if no rotation exists).

First 5 minutes:

Acknowledge the alert (prevents escalation)
Verify the issue is real (check from a different device/network)
Assess scope: complete outage or partial degradation?
Notify the team (Slack #incidents channel)
Start a brief log noting the time and initial observations

Escalation trigger: If the primary responder can't make progress within 15 minutes, escalate to secondary.

3. Communication

Internal communication template:

🔴 INCIDENT STARTED — [Service Name]
Time detected: [HH:MM UTC]
Symptoms: [what users are experiencing]
Assigned to: [@engineer]
Status: Investigating

Update this every 15-30 minutes until resolved.

External communication:

Update your status page immediately with a brief acknowledgement: "We are aware of an issue affecting [service]. We are actively investigating and will provide updates every 30 minutes."

Update the status page every 30 minutes — even if there's no new information. Silence during an incident is more damaging than frequent updates saying "still investigating."

4. Investigation and Diagnosis

A structured investigation checklist:

When exactly did the failure start? (check monitoring dashboard)
What changed just before the failure? (deployments, config changes, infrastructure)
What does the error look like? (status code, error message, timeout?)
What do the application logs say around the start time?
Is the issue affecting all users or a subset?
Is the issue at our infrastructure or at a third party?
Is the issue growing, stable, or recovering?

Time correlation is your most important tool. The monitoring tool's timestamp tells you when the failure started — correlate this with deployment logs, infrastructure changes, and external service status.

5. Resolution

Resolution options in order of preference:

Rollback — if a deployment caused the issue, roll back immediately
Quick fix — a configuration change or restart that resolves the immediate issue
Workaround — disable the broken functionality while a proper fix is prepared
Escalate — if the issue is beyond your team's capacity, bring in additional expertise

After applying the fix:

Monitor recovery via your uptime monitoring tool — wait for monitors to return green
Don't declare resolution until monitoring confirms recovery
Update the status page: "The issue has been resolved. Service is operating normally."

6. All-Clear

When monitoring confirms full recovery:

Update all communication channels
Record the exact recovery time
Calculate total incident duration
Begin the post-incident review process

Incident Severity Levels

Classify incidents to prioritise response:

Severity	Definition	Response Time
P1 — Critical	Complete production outage; all users affected	Immediate (< 5 min)
P2 — High	Major feature broken; significant user impact	< 15 minutes
P3 — Medium	Partial degradation; some users/features affected	< 1 hour
P4 — Low	Minor issue; minimal user impact	Next business day

P1 and P2 warrant immediate SMS escalation and status page updates. P3 can be handled with Slack notification and less urgent communication. P4 is tracked but handled normally.

Roles During an Incident

For teams larger than 2-3 people, define clear roles:

Role	Responsibility
Incident Commander	Overall coordination, final decisions, keeps timeline
Technical Lead	Investigation and fix implementation
Communications Lead	Status page updates, customer communication
Scribe	Documents timeline in real-time

For small teams, one person covers multiple roles. What matters is that someone owns each function — especially communications, which often gets neglected during the technical scramble to fix the issue.

On-Call Rotation

If your service requires 24/7 coverage, document the on-call rotation:

Who is primary on-call this week?
Who is the secondary escalation?
How do they get paged? (Phone, PagerDuty, SMS)
What's the escalation timeout? (Typical: 5-10 minutes before escalating to secondary)

Alerting to the right people is covered in detail in the downtime alerts guide.

Post-Incident Review

After every P1 and P2 incident, schedule a post-mortem within 48-72 hours. See how to write a post-incident report for the full process and template.

The post-mortem closes the improvement loop: each incident makes the next one easier to handle.

Making the Plan Accessible

The best incident response plan is the one your team can find at 3am when the site is down:

Store it in your team wiki (Confluence, Notion, GitHub)
Share the link in your #engineering Slack channel
Include it in onboarding documentation for new engineers
Review and update it annually or after significant incidents

The foundation of any incident response plan is fast detection — set up monitoring at Domain Monitor.

Incident Response Plan for Website Downtime: A Template

Why You Need an Incident Response Plan

The Core Components

1. Detection

2. Initial Response

3. Communication

4. Investigation and Diagnosis

5. Resolution

6. All-Clear

Incident Severity Levels

Roles During an Incident

On-Call Rotation

Post-Incident Review

Making the Plan Accessible

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.

Domain Monitoring

Uptime Monitoring

SSL Monitoring

WHOIS Lookup

Notifications

Status Pages

Ping test

Traceroute test

Find my website's IP

# website monitoring

Incident Response Plan for Website Downtime: A Template

Why You Need an Incident Response Plan

The Core Components

1. Detection

2. Initial Response

3. Communication

4. Investigation and Diagnosis

5. Resolution

6. All-Clear

Incident Severity Levels

Roles During an Incident

On-Call Rotation

Post-Incident Review

Making the Plan Accessible

Related Articles

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.