
When something breaks, the severity level determines how you respond. Without a shared framework, you get inconsistent responses — one person wakes up an engineer at 2am for a cosmetic UI bug, another waits until morning to investigate a broken payment flow.
Severity levels give your team a shared vocabulary for urgency and a clear guide for who does what, when. Here's a practical framework that works for small teams without requiring a full SRE function.
In a small team, everyone knows everything and communication is fast — so it can feel like formal severity levels are overkill. But they serve purposes beyond communication:
Consistent response — Without defined levels, response quality depends on who's on call and their personal judgement about urgency. With levels, the response is predictable regardless of who picks up the alert.
Preventing alarm fatigue — If every alert is treated as critical, people stop responding urgently to anything. Severity levels let you reserve the truly urgent response for truly urgent situations.
Post-incident clarity — When writing post-incident reviews, having a defined severity makes it easier to measure response times, identify patterns, and compare incidents over time.
Customer communication — The severity level informs what you say on your status page and whether you send a proactive email. P1 gets an email; P4 might not even get a status page update.
Definition: Your core product is completely unavailable to all or most users.
Examples:
Response:
Monitoring trigger: Your uptime monitor fires — site is returning non-2xx or no response at all. This is why uptime monitoring matters: P1 incidents detected by users are already past the 5-minute window.
Definition: A significant feature is unavailable or severely degraded, affecting a meaningful portion of users or a critical user flow.
Examples:
Response:
Definition: A non-critical feature is degraded or behaving incorrectly. Users can work around it or the impact is limited.
Examples:
Response:
Definition: Low-impact issues, cosmetic problems, or items that don't affect functionality for most users.
Examples:
Response:
| Level | What it means | Response time | Time-of-day | Status page? |
|---|---|---|---|---|
| P1 | Core product down | Immediate | Any | Yes, within 5 min |
| P2 | Major feature broken | 15 minutes | Any (P2 pages on-call) | Yes, within 15 min |
| P3 | Minor feature degraded | 2 hours (business hours) | Business hours only | Optional |
| P4 | Low-impact issue | Next sprint | Business hours | No |
Two questions determine severity:
How many users are affected? A bug hitting every user is more severe than one hitting 1% of users.
How critical is the affected functionality? Core product functionality (login, core features, payments) is more critical than peripheral features.
When in doubt, escalate up — it's always better to treat a P2 like a P1 and downgrade after investigation than to treat a P1 like a P2 and be slow to respond.
The hardest cases are degraded-but-not-down scenarios:
For these: assess user impact. If users are actively experiencing problems, it's P2. If users haven't noticed yet but will, it's at least P3 (investigate now, before it escalates). If it's purely an internal signal with no user impact, P3 or P4.
Severity levels are only useful if you know about incidents promptly. A P1 detected by a customer email is already a late response by any reasonable standard.
Uptime monitoring gives you the first signal for P1 incidents — and often P2 ones too. Domain Monitor monitors your application every minute from multiple locations and alerts you immediately when your service goes down or starts returning errors. Create a free account and configure alerts to go to the right channel for each severity: PagerDuty or SMS for P1, Slack for P2 and P3.
See how to set up downtime alerts for alert configuration and incident response plan for website downtime for the full response framework. Once severity levels are defined, the next challenge is making sure alerts stay meaningful — see how to reduce alert fatigue without missing real incidents for how to tune routing so P1 alerts always get urgent attention.
Every P1 and P2 deserves a post-incident review. The goal isn't blame — it's improvement.
A good post-incident review covers:
See how to write a post-incident report for a template. Recording these consistently lets you identify patterns — if P1 incidents are often caused by deployment failures, that's a signal to invest in staging and testing.
When your site goes down, your status page becomes the most important page you have. Here's why it matters, what happens when you don't have one, and what a good status page does during a real outage.
Read moreYour domain is resolving, but pointing to the wrong server — showing old content, a previous host's page, or someone else's site entirely. Here's what causes this and how to diagnose it.
Read moreUptime monitoring isn't foolproof. Single-location monitors, wrong health check endpoints, long check intervals, and false positives can all cause real downtime to go undetected. Here's what to watch out for.
Read moreLooking to monitor your website and domains? Join our platform and start today.