Incident Severity Levels Explained for Startups and Small Teams

When something breaks, the severity level determines how you respond. Without a shared framework, you get inconsistent responses — one person wakes up an engineer at 2am for a cosmetic UI bug, another waits until morning to investigate a broken payment flow.

Severity levels give your team a shared vocabulary for urgency and a clear guide for who does what, when. Here's a practical framework that works for small teams without requiring a full SRE function.

Why Severity Levels Matter

In a small team, everyone knows everything and communication is fast — so it can feel like formal severity levels are overkill. But they serve purposes beyond communication:

Consistent response — Without defined levels, response quality depends on who's on call and their personal judgement about urgency. With levels, the response is predictable regardless of who picks up the alert.

Preventing alarm fatigue — If every alert is treated as critical, people stop responding urgently to anything. Severity levels let you reserve the truly urgent response for truly urgent situations.

Post-incident clarity — When writing post-incident reviews, having a defined severity makes it easier to measure response times, identify patterns, and compare incidents over time.

Customer communication — The severity level informs what you say on your status page and whether you send a proactive email. P1 gets an email; P4 might not even get a status page update.

A Four-Level Framework

P1 — Critical: Full Service Down

Definition: Your core product is completely unavailable to all or most users.

Examples:

Application returning 500 errors on every request
Authentication completely broken — no user can log in
Database unreachable, application crashes on every request
Complete DNS failure — domain doesn't resolve

Response:

Wake up the on-call engineer immediately, regardless of time
Post a status page update within 5 minutes of detection
All-hands response — whoever is available joins the incident
Customer communication: email or in-app notification if outage exceeds 15 minutes
Target: service restored within 60 minutes

Monitoring trigger: Your uptime monitor fires — site is returning non-2xx or no response at all. This is why uptime monitoring matters: P1 incidents detected by users are already past the 5-minute window.

P2 — High: Major Feature Broken

Definition: A significant feature is unavailable or severely degraded, affecting a meaningful portion of users or a critical user flow.

Examples:

Payment processing failing (site is up, checkout is broken)
Email sending broken — password resets not arriving
API returning errors for authenticated requests
Performance severely degraded (pages taking 10+ seconds)
A specific customer segment can't access their data

Response:

Engineer notified within 15 minutes, response starts immediately during business hours; on-call paged for out-of-hours
Status page update within 15 minutes
Investigation starts immediately; fix or workaround targeted within 2 hours
Customer communication on status page; direct email if a major customer is impacted

P3 — Medium: Minor Feature Degraded

Definition: A non-critical feature is degraded or behaving incorrectly. Users can work around it or the impact is limited.

Examples:

Report generation slower than usual
A non-critical notification not sending
UI bug affecting a small number of users
Export function failing for some file formats

Response:

Notification during business hours only
Acknowledged within 2 hours during business hours
Fix targeted in the next sprint or within 24–48 hours
Status page update optional; relevant if users might notice

P4 — Low: Minor Issues

Definition: Low-impact issues, cosmetic problems, or items that don't affect functionality for most users.

Examples:

Minor UI inconsistency
Non-critical logging errors
Performance issues only visible in metrics, not user experience
Feature request or improvement masquerading as a bug

Response:

Logged as a ticket in your issue tracker
No immediate action required
Fixed as part of normal sprint planning
No status page update needed

Quick Reference

Level	What it means	Response time	Time-of-day	Status page?
P1	Core product down	Immediate	Any	Yes, within 5 min
P2	Major feature broken	15 minutes	Any (P2 pages on-call)	Yes, within 15 min
P3	Minor feature degraded	2 hours (business hours)	Business hours only	Optional
P4	Low-impact issue	Next sprint	Business hours	No

Assigning Severity in Practice

Two questions determine severity:

How many users are affected? A bug hitting every user is more severe than one hitting 1% of users.

How critical is the affected functionality? Core product functionality (login, core features, payments) is more critical than peripheral features.

When in doubt, escalate up — it's always better to treat a P2 like a P1 and downgrade after investigation than to treat a P1 like a P2 and be slow to respond.

The Ambiguous Middle Ground

The hardest cases are degraded-but-not-down scenarios:

Site is up, but very slow
Feature works for most users but fails for a specific segment
A background job is failing but users haven't noticed yet

For these: assess user impact. If users are actively experiencing problems, it's P2. If users haven't noticed yet but will, it's at least P3 (investigate now, before it escalates). If it's purely an internal signal with no user impact, P3 or P4.

Connecting Severity to Monitoring

Severity levels are only useful if you know about incidents promptly. A P1 detected by a customer email is already a late response by any reasonable standard.

Uptime monitoring gives you the first signal for P1 incidents — and often P2 ones too. Domain Monitor monitors your application every minute from multiple locations and alerts you immediately when your service goes down or starts returning errors. Create a free account and configure alerts to go to the right channel for each severity: PagerDuty or SMS for P1, Slack for P2 and P3.

See how to set up downtime alerts for alert configuration and incident response plan for website downtime for the full response framework. Once severity levels are defined, the next challenge is making sure alerts stay meaningful — see how to reduce alert fatigue without missing real incidents for how to tune routing so P1 alerts always get urgent attention.

Post-Incident Process

Every P1 and P2 deserves a post-incident review. The goal isn't blame — it's improvement.

A good post-incident review covers:

Timeline: when did the incident start, when was it detected, when was it resolved?
Root cause: what actually caused it?
Detection gap: why didn't we know sooner?
Action items: what changes will prevent recurrence?

See how to write a post-incident report for a template. Recording these consistently lets you identify patterns — if P1 incidents are often caused by deployment failures, that's a signal to invest in staging and testing.

Incident Severity Levels Explained for Startups and Small Teams

Why Severity Levels Matter

A Four-Level Framework

P1 — Critical: Full Service Down

P2 — High: Major Feature Broken

P3 — Medium: Minor Feature Degraded

P4 — Low: Minor Issues

Quick Reference

Assigning Severity in Practice

The Ambiguous Middle Ground

Connecting Severity to Monitoring

Post-Incident Process

Also in This Series

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.

Domain Monitoring

Uptime Monitoring

SSL Monitoring

WHOIS Lookup

Notifications

Status Pages

Ping test

Traceroute test

Find my website's IP

# website monitoring

Incident Severity Levels Explained for Startups and Small Teams

Why Severity Levels Matter

A Four-Level Framework

P1 — Critical: Full Service Down

P2 — High: Major Feature Broken

P3 — Medium: Minor Feature Degraded

P4 — Low: Minor Issues

Quick Reference

Assigning Severity in Practice

The Ambiguous Middle Ground

Connecting Severity to Monitoring

Post-Incident Process

Also in This Series

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.