Website downtime alert notification on a developer's phone showing red error indicators and response time charts
# website monitoring

The Complete Guide to Website Downtime: Causes, Prevention and Response

Every website goes down at some point. The best ones go down rarely, recover quickly, and handle it transparently. The worst ones go down without anyone noticing for hours, lose significant revenue and user trust, and scramble to figure out what happened.

The difference between those two outcomes is largely preparation: understanding what causes downtime, having monitoring in place to detect it immediately, and knowing exactly what to do when it happens.

This guide covers all of it.


What Is Website Downtime?

Website downtime is any period during which your website or application is unavailable or failing to function correctly for users. That includes:

  • Complete outages — The server is unreachable, returning no response
  • Error states — The server is running but returning error codes (500, 502, 503, 504)
  • Partial failures — Some features work but others don't (checkout broken, login failing, API down)
  • Performance degradation — The site is technically accessible but so slow it's unusable

"Uptime" refers to the percentage of time your service is available. A site that's down for 43 minutes per month has 99.9% uptime. One hour of downtime per month is 99.86%. These small differences compound — for a high-traffic commercial site, even tenths of a percentage point represent significant revenue loss.


The Real Cost of Downtime

The direct cost is measurable: for any site generating revenue, every minute of downtime is revenue not being made. But the indirect costs are often larger:

Trust damage — Users who encounter a down site don't assume it's a one-off. They wonder how often it happens, whether their data is safe, whether the company is reliable. Recovering that trust takes far longer than fixing the downtime itself.

SEO impact — Extended downtime or repeated short outages can affect search rankings. Crawlers that repeatedly fail to access your site may reduce crawl frequency or drop pages from the index.

Support burden — Downtime generates support tickets. Even a short outage can take hours of support time to address.

Churn — For SaaS products, downtime during a customer's workflow is one of the highest-churn-risk events. Customers who hit downtime and find an alternative may not come back.

Developer time — Every incident requires investigation, resolution, and post-incident review. This is time not spent building.


Common Causes of Website Downtime

1. Server and Infrastructure Failures

Hardware fails. Disks die, network cards fail, power supplies go. Hosting providers have their own failure events — data centre issues, network problems, hardware failures at scale.

Mitigation: Redundancy. Use a hosting provider with strong uptime SLAs. Consider multiple availability zones for critical applications. Have clear runbooks for infrastructure failures.

2. Application Code Errors

A bad deployment can take down a site immediately. A slow memory leak can take it down hours later. An unhandled exception in a code path that gets hit at scale can cause cascading failures.

Mitigation: Thorough staging environments. Gradual rollouts. Automated tests. Rollback procedures. Monitoring that catches application errors before they become total outages.

3. Database Problems

The database is often the first thing to buckle under load or fail when something goes wrong. Full disks, exhausted connection pools, runaway queries, failed replication — any of these can take your entire application down.

Mitigation: Monitor database connection counts and query times. Set up disk space alerts. Use connection pooling. Have a replication strategy and documented failover procedure.

4. Traffic Spikes

Unexpected traffic — from a viral post, a product launch, press coverage, or a DDoS attack — can overwhelm servers that are sized for normal load. The result is slow responses that tip into 503 errors.

Mitigation: Load testing before major traffic events. Auto-scaling where possible. Rate limiting. A CDN to handle static assets and reduce origin load. See what is a CDN.

5. SSL Certificate Expiry

An expired SSL certificate triggers security warnings in browsers that block most users from accessing your site. This is entirely preventable — it happens when certificate renewal fails silently and no one is monitoring expiry dates.

Mitigation: Use automated certificate renewal (Let's Encrypt with Certbot or Caddy). Monitor certificate expiry dates with an automated tool. See complete guide to SSL certificates.

6. DNS Failures

If your DNS records are misconfigured, deleted, or hijacked, your domain stops resolving — meaning nobody can find your server, even if it's running perfectly.

DNS failures can come from:

  • Accidentally deleted or modified records
  • Nameserver changes that miss some records
  • Domain expiry (the registration lapses)
  • DNS provider outage

Mitigation: DNS monitoring. Automated domain expiry alerts. Use a reliable DNS provider. See ultimate guide to DNS.

7. DDoS Attacks

A distributed denial-of-service attack floods your server with traffic from many sources, overwhelming its capacity to handle legitimate requests.

Mitigation: A CDN or DDoS mitigation service (Cloudflare is the most common). Rate limiting at the application and infrastructure level. A hosting provider with DDoS protection.

8. Third-Party Dependencies

If your application depends on external services — payment processors, email APIs, AI services, identity providers — their downtime becomes your downtime. A Stripe outage means payments fail. An email service outage means verification emails don't send.

Mitigation: Monitor your integration points, not just your own infrastructure. Have graceful degradation for non-critical dependencies. Display clear user-facing messages when a dependency fails.

9. Deployment Failures

A misconfigured environment variable, a missing migration, an incompatible dependency — deployments fail in many ways. When they do, they often take the site down with them.

Mitigation: Staged deployments. Automated smoke tests post-deploy. Instant rollback capability. Monitoring that catches post-deployment failures within seconds.

10. Resource Exhaustion

Disk full, file descriptor limit reached, memory exhausted, CPU pegged at 100% — any of these can cause slow degradation into outright failure.

Mitigation: Resource monitoring with alerts for approaching limits. Log rotation. Database archiving. Right-sizing servers for actual load.


Detecting Downtime Quickly

The single most impactful thing you can do is ensure you know about downtime immediately — not when a user emails you, not when you happen to visit your own site. Immediately.

Uptime Monitoring

An uptime monitoring service makes automated requests to your URLs every minute (or faster) from multiple locations and alerts you the moment something fails. This is the baseline requirement for any production website.

The key characteristics of effective uptime monitoring:

Minute-by-minute checks — Every minute matters. A five-minute check interval means downtime could go undetected for five minutes; with minute-by-minute checks, you know within 60 seconds.

Multi-location — Checks from multiple geographic locations distinguish real outages from regional routing issues and eliminate false positives.

Immediate alerting — Alert within one failed check, or at most two, to avoid false positives from transient issues. Every alert configuration choice trades detection speed against false positive rate.

Multiple alert channels — Email is reliable but can be slow if not actively monitored. SMS and Slack integrations ensure critical alerts reach someone quickly.

Domain Monitor does all of this — create a free account and set up your first monitor in minutes.

What to Monitor

  • Your homepage
  • Your login and signup pages (including /account/create/)
  • Your API health check endpoint
  • Key transactional pages (checkout, payment, critical features)
  • SSL certificate expiry
  • DNS record integrity

See website monitoring checklist for developers for a complete list.


Responding to Downtime

When an alert fires, how you respond determines how quickly service is restored and how much trust you lose.

The First Five Minutes

  1. Acknowledge — Let your team know an alert has fired and someone is investigating. Mark the incident as acknowledged in your monitoring tool.

  2. Check your monitoring dashboard — Is it one location or all? One endpoint or all? When did it start? This shapes your diagnosis.

  3. Try to reproduce — Visit the URL yourself from a fresh browser or mobile device. What do you see?

  4. Post a status update — Within the first few minutes, post to your status page: "We are investigating an issue affecting [service]. We will provide an update shortly." Users who see this immediately know you're aware.

Diagnosing the Cause

Work through systematically:

  • Recent deployments — Did something change in the last hour? A deployment is the most common cause of sudden downtime.
  • Server logs — What errors are being logged? Nginx/Apache error logs, application logs, database logs.
  • Server resources — Is CPU, memory, or disk full? Check with top, free -h, df -h.
  • Database — Can you connect? Are queries running? Is the connection pool exhausted?
  • External dependencies — Are third-party services reporting issues?

Recovery

  • Deploy a fix if code is the cause
  • Roll back the last deployment if you can't identify a specific fix quickly
  • Restart services if a process has crashed or hung (but understand why before doing this repeatedly)
  • Scale resources if overload is the cause

Post-Incident

Once service is restored:

  • Update your status page with resolution details
  • Conduct a post-incident review — what happened, what was the timeline, what caused it, what will prevent recurrence
  • Implement the preventive measures identified in the review

See how to write a post-incident report for how to structure the review, and how to communicate website downtime for communication guidance throughout the incident.


Preventing Downtime

Monitoring catches downtime quickly. Prevention reduces how often it happens.

Staging environments — Test everything in an environment that mirrors production before deploying. Catch database migration failures, environment variable issues, and dependency conflicts before they hit production.

Gradual rollouts — Deploy to a percentage of users first. If something breaks, it affects a fraction of users and can be caught before full rollout.

Automated testing — Tests that run on every commit and block deployment if they fail prevent a large class of application errors from reaching production.

Dependency monitoring — Track your external dependencies. Many have status pages and incident notifications. Subscribe to them.

Regular audits — Review your certificate expiry dates, domain expiry dates, and DNS records periodically. Monitoring handles this automatically, but periodic manual review catches configuration drift.

Capacity planning — If your application is growing, plan for that growth in infrastructure before you hit limits. Reactive scaling is expensive and stressful; proactive scaling is straightforward.


Building a Downtime Response Plan

Having a documented response plan before an incident means faster, calmer responses during one.

Your plan should cover:

  • Who is notified (and via what channel) when an alert fires
  • Who has access to investigate (server access, database access, deployment access)
  • Escalation path if the primary on-call can't resolve it
  • Communication responsibilities (who posts status updates)
  • Rollback procedure for deployments
  • Contact information for hosting provider, DNS provider, and other critical vendors

See incident response plan for website downtime for a template.


Measuring Uptime

Uptime is expressed as a percentage of time your service was available over a period. The common benchmarks:

Uptime %Downtime per monthDowntime per year
99%~7.3 hours~3.65 days
99.9%~43.8 minutes~8.77 hours
99.95%~21.9 minutes~4.38 hours
99.99%~4.4 minutes~52.6 minutes

"Five nines" (99.999%) — less than 5 minutes of downtime per year — is the standard for highly critical infrastructure. For most websites, 99.9% is a realistic and respectable target.

Domain Monitor tracks your uptime percentage and provides detailed reports showing when incidents occurred, how long they lasted, and your overall availability. Create a free account to start tracking.

More posts

What Is Generative AI? How It Works and What It Creates

Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.

Read more
What Is Cursor AI? The AI Code Editor Explained

Cursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.

Read more
What Is Claude Opus? Anthropic's Most Powerful Model Explained

Claude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.