
Most teams set up monitoring, confirm that the dashboard shows green, and consider the job done. Then the first real incident reveals that alerts were misconfigured, the on-call engineer did not receive the notification, or the monitoring check was looking at the wrong endpoint.
Testing your monitoring setup before a real incident is essential. This guide covers how to validate every layer of your monitoring stack.
A monitoring system that has never been tested is a monitoring system you cannot trust. Common failure modes discovered only during a real incident:
The cost of discovering these in a test is zero. The cost of discovering them during a production incident is measured in hours of undetected downtime.
The simplest test: temporarily return a non-200 status from your application.
Maintenance mode (temporary redirect):
# nginx — return 503 for 2 minutes
location / {
return 503 "Testing monitoring";
}
Express.js test endpoint:
// Temporarily enable via environment variable
app.get('/', (req, res) => {
if (process.env.SIMULATE_DOWN === 'true') {
return res.status(503).json({ error: 'Service unavailable' });
}
// ... normal handler
});
After triggering the simulated failure, verify:
Allow at least 5 minutes of simulated downtime to ensure the alert triggers. Most monitoring tools require 2-3 consecutive failures before alerting, so a 1-minute failure may not trigger anything.
If your monitors check for specific text content (e.g., verifying a string that should appear on your homepage), test the negative case:
This validates that your content checks are actually working, not just checking that the server returns 200.
If you use multi-location monitoring, verify that each location reports independently. A useful test: restrict access from one geographic region (firewall rule or geo-block) and confirm the monitor for that region fires while others remain green.
SSL expiry alerts are often set up and forgotten. Test them by:
Reviewing configured thresholds: Check that your alert fires at 60, 30, and 14 days before expiry — not just one threshold.
Checking the right domains: List all domains your SSL monitor covers. Missing a subdomain is a common oversight.
Verifying alert routing: SSL expiry alerts often go to a different team (DevOps, platform) than uptime alerts (engineering on-call). Confirm the routing is correct.
You can also use SSL Labs to check your certificate details and verify that your monitoring tool is reporting the same expiry date.
Domain expiry alerts have longer timescales than SSL (domains renew annually, not every 90 days), but the test approach is similar:
For WHOIS monitoring, simulate a record change by checking that your tool detects the current registrar and nameservers correctly, and review the alert configuration for registrar changes.
Heartbeat monitors detect missed cron jobs and background processes. Testing them is straightforward: simply do not send the expected ping.
Method 1: Disable the job temporarily
# Comment out the cron job
# 0 3 * * * /scripts/backup.sh && curl https://domain-monitor.io/heartbeat/abc123
Wait for the grace period to expire and verify the alert fires.
Method 2: Send the ping with a test flag Some heartbeat services support a test mode that triggers an alert without requiring you to wait for a missed interval.
Method 3: Check the "last ping" timestamp Verify the monitoring dashboard shows the correct last ping time. If the timestamp is stale, the job may have stopped running without triggering an alert (if within the grace period).
See how to monitor cron jobs for heartbeat implementation details.
Slack webhook URLs expire when integrations are removed or reinstalled. Test them actively:
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Test alert from monitoring system"}' \
YOUR_WEBHOOK_URL
If this returns a 200, the webhook is valid. If not, recreate the webhook integration.
Test the full escalation chain:
See what is on-call management for escalation policy design.
If your monitoring tool auto-updates a status page during incidents, verify this works:
Run through this checklist quarterly or after any significant infrastructure change:
Every real incident is also a monitoring test. In your post-incident report, include a monitoring review section:
Use the answers to improve your monitoring setup before the next incident. The goal is continuous improvement: each incident should make your monitoring more reliable than it was before.
Run a monitoring test at Domain Monitor — verify your alerts, SSL checks, and heartbeat monitors are all working correctly.
Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.
Read moreCursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.
Read moreClaude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.
Read moreLooking to monitor your website and domains? Join our platform and start today.