
When something breaks, the severity level determines how you respond. Without a shared framework, you get inconsistent responses — one person wakes up an engineer at 2am for a cosmetic UI bug, another waits until morning to investigate a broken payment flow.
Severity levels give your team a shared vocabulary for urgency and a clear guide for who does what, when. Here's a practical framework that works for small teams without requiring a full SRE function.
In a small team, everyone knows everything and communication is fast — so it can feel like formal severity levels are overkill. But they serve purposes beyond communication:
Consistent response — Without defined levels, response quality depends on who's on call and their personal judgement about urgency. With levels, the response is predictable regardless of who picks up the alert.
Preventing alarm fatigue — If every alert is treated as critical, people stop responding urgently to anything. Severity levels let you reserve the truly urgent response for truly urgent situations.
Post-incident clarity — When writing post-incident reviews, having a defined severity makes it easier to measure response times, identify patterns, and compare incidents over time.
Customer communication — The severity level informs what you say on your status page and whether you send a proactive email. P1 gets an email; P4 might not even get a status page update.
Definition: Your core product is completely unavailable to all or most users.
Examples:
Response:
Monitoring trigger: Your uptime monitor fires — site is returning non-2xx or no response at all. This is why uptime monitoring matters: P1 incidents detected by users are already past the 5-minute window.
Definition: A significant feature is unavailable or severely degraded, affecting a meaningful portion of users or a critical user flow.
Examples:
Response:
Definition: A non-critical feature is degraded or behaving incorrectly. Users can work around it or the impact is limited.
Examples:
Response:
Definition: Low-impact issues, cosmetic problems, or items that don't affect functionality for most users.
Examples:
Response:
| Level | What it means | Response time | Time-of-day | Status page? |
|---|---|---|---|---|
| P1 | Core product down | Immediate | Any | Yes, within 5 min |
| P2 | Major feature broken | 15 minutes | Any (P2 pages on-call) | Yes, within 15 min |
| P3 | Minor feature degraded | 2 hours (business hours) | Business hours only | Optional |
| P4 | Low-impact issue | Next sprint | Business hours | No |
Two questions determine severity:
How many users are affected? A bug hitting every user is more severe than one hitting 1% of users.
How critical is the affected functionality? Core product functionality (login, core features, payments) is more critical than peripheral features.
When in doubt, escalate up — it's always better to treat a P2 like a P1 and downgrade after investigation than to treat a P1 like a P2 and be slow to respond.
The hardest cases are degraded-but-not-down scenarios:
For these: assess user impact. If users are actively experiencing problems, it's P2. If users haven't noticed yet but will, it's at least P3 (investigate now, before it escalates). If it's purely an internal signal with no user impact, P3 or P4.
Severity levels are only useful if you know about incidents promptly. A P1 detected by a customer email is already a late response by any reasonable standard.
Uptime monitoring gives you the first signal for P1 incidents — and often P2 ones too. Domain Monitor monitors your application every minute from multiple locations and alerts you immediately when your service goes down or starts returning errors. Create a free account and configure alerts to go to the right channel for each severity: PagerDuty or SMS for P1, Slack for P2 and P3.
See how to set up downtime alerts for alert configuration and incident response plan for website downtime for the full response framework. Once severity levels are defined, the next challenge is making sure alerts stay meaningful — see how to reduce alert fatigue without missing real incidents for how to tune routing so P1 alerts always get urgent attention.
Every P1 and P2 deserves a post-incident review. The goal isn't blame — it's improvement.
A good post-incident review covers:
See how to write a post-incident report for a template. Recording these consistently lets you identify patterns — if P1 incidents are often caused by deployment failures, that's a signal to invest in staging and testing.
A subdomain takeover lets an attacker claim your subdomain by exploiting dangling DNS records. Learn how it happens, real-world examples, and how DNS monitoring detects it.
Read moreMean time to detect (MTTD) measures how long it takes to discover an incident after it starts. Reducing MTTD is one of the highest-leverage improvements in reliability engineering.
Read moreBlack box monitoring tests your systems from the outside, the way users experience them — without access to internal code or infrastructure. Learn how it works and when to use it.
Read moreLooking to monitor your website and domains? Join our platform and start today.