
If you've committed to a 99.9% uptime SLA, you've implicitly created an error budget — 0.1% of the time, or about 8.7 hours per year, where your service is allowed to be unavailable. How you spend that budget determines how quickly you can ship features and how much risk you can take with deployments.
Error budgets are a concept from Site Reliability Engineering (SRE) popularised by Google, and they provide a systematic framework for making reliability vs. velocity trade-offs.
An error budget is the maximum amount of unreliability you're willing to tolerate in a given period, derived directly from your uptime target.
Error budget = 1 − SLA target
For common SLA targets:
| Uptime SLA | Error Budget (Annual) | Error Budget (Monthly) |
|---|---|---|
| 99% | 3.65 days | 7.3 hours |
| 99.5% | 1.83 days | 3.65 hours |
| 99.9% | 8.76 hours | 43.8 minutes |
| 99.95% | 4.38 hours | 21.9 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes |
See what does 99.9% uptime really mean? for a more detailed breakdown of what these numbers mean in practice.
The key insight of the error budget model is this: if you're not spending your error budget, you're being too conservative. An organisation with a 99.9% SLA that achieves 99.99% uptime has "left error budget on the table" — they could have shipped more features, experimented more aggressively, or deployed more frequently.
Error budgets give product and engineering teams a shared language for reliability trade-offs:
This removes the subjective argument of "is it safe to ship?" and replaces it with a data-driven answer: "how much budget do we have left?"
Organisations with mature reliability practices document their error budget policies explicitly:
Your uptime SLA is your starting point. If you haven't formally defined one, start with a realistic target based on your infrastructure and team capacity.
This is where website uptime monitoring becomes essential. You can't track error budget consumption without accurate uptime data.
Your monitoring tool's reports give you:
If your monthly error budget for 99.9% SLA is 43.8 minutes, and you had 12 minutes of downtime this month:
Budget consumed: 12 / 43.8 = 27.4% Budget remaining: 72.6%
Track your error budget consumption week over week. A budget that's 80% consumed in week 2 of 4 is a signal to slow down.
Error budgets originated in large-scale SRE organisations like Google, but the concept scales down to small teams and products.
A simplified approach for smaller organisations:
You don't need a formal error budget policy to benefit from this thinking. Simply knowing that you have X minutes of allowed downtime this month — and knowing how much you've used — changes how you make deployment decisions.
Any period where your service is unavailable or degraded below your SLA threshold consumes error budget. This includes:
Your uptime monitoring records all of these, providing the data you need for accurate budget calculations.
Error budget consumption is the product of incident frequency and incident duration. Reducing either reduces budget consumption:
This is why MTTR and error budget tracking work together — minimising MTTR directly reduces error budget consumption.
You don't need complex tooling to start tracking error budgets. A monitoring tool like Domain Monitor provides the uptime data you need. A simple spreadsheet can calculate budget consumption from that data.
As your reliability practices mature, dedicated SLO tracking tools (Grafana SLO, Datadog SLOs, Google Cloud SLOs) can automate the calculation and alerting.
Track your uptime data with precision at Domain Monitor.
A subdomain takeover lets an attacker claim your subdomain by exploiting dangling DNS records. Learn how it happens, real-world examples, and how DNS monitoring detects it.
Read moreMean time to detect (MTTD) measures how long it takes to discover an incident after it starts. Reducing MTTD is one of the highest-leverage improvements in reliability engineering.
Read moreBlack box monitoring tests your systems from the outside, the way users experience them — without access to internal code or infrastructure. Learn how it works and when to use it.
Read moreLooking to monitor your website and domains? Join our platform and start today.