Incident severity matrix showing P1 through P4 levels with response times, escalation paths and example scenarios
# website monitoring

Incident Severity Levels Explained for Startups and Small Teams

When something breaks, the severity level determines how you respond. Without a shared framework, you get inconsistent responses — one person wakes up an engineer at 2am for a cosmetic UI bug, another waits until morning to investigate a broken payment flow.

Severity levels give your team a shared vocabulary for urgency and a clear guide for who does what, when. Here's a practical framework that works for small teams without requiring a full SRE function.

Why Severity Levels Matter

In a small team, everyone knows everything and communication is fast — so it can feel like formal severity levels are overkill. But they serve purposes beyond communication:

Consistent response — Without defined levels, response quality depends on who's on call and their personal judgement about urgency. With levels, the response is predictable regardless of who picks up the alert.

Preventing alarm fatigue — If every alert is treated as critical, people stop responding urgently to anything. Severity levels let you reserve the truly urgent response for truly urgent situations.

Post-incident clarity — When writing post-incident reviews, having a defined severity makes it easier to measure response times, identify patterns, and compare incidents over time.

Customer communication — The severity level informs what you say on your status page and whether you send a proactive email. P1 gets an email; P4 might not even get a status page update.


A Four-Level Framework

P1 — Critical: Full Service Down

Definition: Your core product is completely unavailable to all or most users.

Examples:

  • Application returning 500 errors on every request
  • Authentication completely broken — no user can log in
  • Database unreachable, application crashes on every request
  • Complete DNS failure — domain doesn't resolve

Response:

  • Wake up the on-call engineer immediately, regardless of time
  • Post a status page update within 5 minutes of detection
  • All-hands response — whoever is available joins the incident
  • Customer communication: email or in-app notification if outage exceeds 15 minutes
  • Target: service restored within 60 minutes

Monitoring trigger: Your uptime monitor fires — site is returning non-2xx or no response at all. This is why uptime monitoring matters: P1 incidents detected by users are already past the 5-minute window.


P2 — High: Major Feature Broken

Definition: A significant feature is unavailable or severely degraded, affecting a meaningful portion of users or a critical user flow.

Examples:

  • Payment processing failing (site is up, checkout is broken)
  • Email sending broken — password resets not arriving
  • API returning errors for authenticated requests
  • Performance severely degraded (pages taking 10+ seconds)
  • A specific customer segment can't access their data

Response:

  • Engineer notified within 15 minutes, response starts immediately during business hours; on-call paged for out-of-hours
  • Status page update within 15 minutes
  • Investigation starts immediately; fix or workaround targeted within 2 hours
  • Customer communication on status page; direct email if a major customer is impacted

P3 — Medium: Minor Feature Degraded

Definition: A non-critical feature is degraded or behaving incorrectly. Users can work around it or the impact is limited.

Examples:

  • Report generation slower than usual
  • A non-critical notification not sending
  • UI bug affecting a small number of users
  • Export function failing for some file formats

Response:

  • Notification during business hours only
  • Acknowledged within 2 hours during business hours
  • Fix targeted in the next sprint or within 24–48 hours
  • Status page update optional; relevant if users might notice

P4 — Low: Minor Issues

Definition: Low-impact issues, cosmetic problems, or items that don't affect functionality for most users.

Examples:

  • Minor UI inconsistency
  • Non-critical logging errors
  • Performance issues only visible in metrics, not user experience
  • Feature request or improvement masquerading as a bug

Response:

  • Logged as a ticket in your issue tracker
  • No immediate action required
  • Fixed as part of normal sprint planning
  • No status page update needed

Quick Reference

LevelWhat it meansResponse timeTime-of-dayStatus page?
P1Core product downImmediateAnyYes, within 5 min
P2Major feature broken15 minutesAny (P2 pages on-call)Yes, within 15 min
P3Minor feature degraded2 hours (business hours)Business hours onlyOptional
P4Low-impact issueNext sprintBusiness hoursNo

Assigning Severity in Practice

Two questions determine severity:

How many users are affected? A bug hitting every user is more severe than one hitting 1% of users.

How critical is the affected functionality? Core product functionality (login, core features, payments) is more critical than peripheral features.

When in doubt, escalate up — it's always better to treat a P2 like a P1 and downgrade after investigation than to treat a P1 like a P2 and be slow to respond.

The Ambiguous Middle Ground

The hardest cases are degraded-but-not-down scenarios:

  • Site is up, but very slow
  • Feature works for most users but fails for a specific segment
  • A background job is failing but users haven't noticed yet

For these: assess user impact. If users are actively experiencing problems, it's P2. If users haven't noticed yet but will, it's at least P3 (investigate now, before it escalates). If it's purely an internal signal with no user impact, P3 or P4.


Connecting Severity to Monitoring

Severity levels are only useful if you know about incidents promptly. A P1 detected by a customer email is already a late response by any reasonable standard.

Uptime monitoring gives you the first signal for P1 incidents — and often P2 ones too. Domain Monitor monitors your application every minute from multiple locations and alerts you immediately when your service goes down or starts returning errors. Create a free account and configure alerts to go to the right channel for each severity: PagerDuty or SMS for P1, Slack for P2 and P3.

See how to set up downtime alerts for alert configuration and incident response plan for website downtime for the full response framework. Once severity levels are defined, the next challenge is making sure alerts stay meaningful — see how to reduce alert fatigue without missing real incidents for how to tune routing so P1 alerts always get urgent attention.


Post-Incident Process

Every P1 and P2 deserves a post-incident review. The goal isn't blame — it's improvement.

A good post-incident review covers:

  • Timeline: when did the incident start, when was it detected, when was it resolved?
  • Root cause: what actually caused it?
  • Detection gap: why didn't we know sooner?
  • Action items: what changes will prevent recurrence?

See how to write a post-incident report for a template. Recording these consistently lets you identify patterns — if P1 incidents are often caused by deployment failures, that's a signal to invest in staging and testing.


Also in This Series

More posts

Why Your Status Page Matters During an Outage

When your site goes down, your status page becomes the most important page you have. Here's why it matters, what happens when you don't have one, and what a good status page does during a real outage.

Read more
Why Your Domain Points to the Wrong Server

Your domain is resolving, but pointing to the wrong server — showing old content, a previous host's page, or someone else's site entirely. Here's what causes this and how to diagnose it.

Read more
Why Website Monitoring Misses Downtime Sometimes

Uptime monitoring isn't foolproof. Single-location monitors, wrong health check endpoints, long check intervals, and false positives can all cause real downtime to go undetected. Here's what to watch out for.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.