
Alert fatigue is a monitoring failure mode that's easy to miss because it looks like success. You have lots of alerts configured. Notifications fire frequently. And then, gradually, the team stops treating them as urgent. Someone mutes a noisy channel. Alerts go unacknowledged for hours. And eventually, a real incident is missed because it looked the same as all the noise.
The fix isn't to monitor less — it's to make every alert actionable. Here's how.
Before fixing it, understand what created it:
Alerts on symptoms that aren't problems. CPU at 60% is a normal operating state for many servers, not an emergency. Alerting on it at 60% means the alert fires constantly without meaning anything.
No confirmation threshold. An alert that fires the moment a single check fails creates noise from transient network blips. A service that fails one check in 60 and succeeds all the others doesn't need an alert.
Alerts without owners. If an alert goes to a shared channel where no specific person is responsible, it gets treated as someone else's problem.
Duplicate alerts for the same root cause. If your database goes down, you may get alerts from your uptime monitor, your application error tracker, your log aggregator, and your queue monitor — all for the same root cause. The team receives 4 alerts about 1 incident.
Non-critical alerts in critical channels. If P3 and P4 alerts go to the same channel as P1 alerts, the constant P3/P4 noise trains people to ignore the channel entirely.
Before you create any alert, answer: "What specific action will someone take when this fires?"
If you can't answer that clearly, the alert shouldn't exist or isn't configured correctly yet.
Good answers:
Bad answers:
Most uptime monitors let you configure a confirmation count — alert only after N consecutive failures. Use it.
A single failed check often represents:
Two or three consecutive failures across multiple monitoring locations is almost always a real problem.
# Good configuration
Fail condition: 2 consecutive checks fail from 3+ locations
Alert condition: ALL of the above
# Noisy configuration
Fail condition: 1 check fails from any location
Alert condition: immediately
For most services, alerting after 2 consecutive failures (2 minutes if checking every minute) is the right balance. It eliminates almost all false positives without significantly delaying detection of real incidents.
Not every alert deserves the same response or the same channel:
| Severity | Routing | Time |
|---|---|---|
| P1 — service down | PagerDuty / phone call / SMS | Immediate, any time |
| P2 — major degradation | Slack + email | Immediate during business hours; on-call after hours |
| P3 — minor issues | Slack | Business hours only |
| P4 — informational | Email digest | Next business day |
The key principle: P1 alerts should make noise at 3am and be impossible to miss. P4 alerts should never make noise at 3am under any circumstances.
Separate channels for different severity levels, and train your team: if it's in #incidents-critical, respond immediately; if it's in #incidents-info, review during business hours.
Alerts from a single monitoring location are more prone to false positives because they can be caused by:
Alerts that require confirmation from multiple locations are almost always real:
Alert condition: 3+ locations reporting failure
This is especially important for suppressing the "phantom downtime" alerts that happen once or twice a week and train your team to dismiss alerts.
Scheduled maintenance, deployments, and restarts should not generate alerts. Configure maintenance windows to suppress alerts during these periods.
This is more important than it seems. Every false alert during a deployment trains your team to dismiss alerts that fire around deployment time — including the ones that indicate the deployment actually broke something.
Alert on what users experience, not server internals.
Instead of: "CPU is above 70%" (users may not notice) Alert on: "Response time exceeds 5 seconds" (users definitely notice)
Instead of: "Database has 80 active connections" (may be fine) Alert on: "Health check returning 503" (users can't use the app)
Internal metrics are useful for dashboards and post-incident analysis. Alerts should generally reflect user-facing outcomes.
Schedule a quarterly alert audit:
This is the maintenance work that keeps monitoring useful. Alerts configured once and never reviewed drift over time as your system changes.
When configuring any new alert:
Good alert configuration is a feature, not just a setting. Domain Monitor checks your services every minute from multiple global locations and only alerts when multiple consecutive checks fail from multiple locations — eliminating the majority of false positive alerts while detecting real incidents quickly. Create a free account and configure alerts that your team will actually respond to.
See incident severity levels explained for the severity framework, and uptime monitoring best practices for broader monitoring configuration guidance.
When your site goes down, your status page becomes the most important page you have. Here's why it matters, what happens when you don't have one, and what a good status page does during a real outage.
Read moreYour domain is resolving, but pointing to the wrong server — showing old content, a previous host's page, or someone else's site entirely. Here's what causes this and how to diagnose it.
Read moreUptime monitoring isn't foolproof. Single-location monitors, wrong health check endpoints, long check intervals, and false positives can all cause real downtime to go undetected. Here's what to watch out for.
Read moreLooking to monitor your website and domains? Join our platform and start today.