Error budget dashboard showing remaining SLA allowance and uptime consumption over time
# website monitoring

What Is an Error Budget? Balancing Reliability and Development Speed

If you've committed to a 99.9% uptime SLA, you've implicitly created an error budget — 0.1% of the time, or about 8.7 hours per year, where your service is allowed to be unavailable. How you spend that budget determines how quickly you can ship features and how much risk you can take with deployments.

Error budgets are a concept from Site Reliability Engineering (SRE) popularised by Google, and they provide a systematic framework for making reliability vs. velocity trade-offs.

What Is an Error Budget?

An error budget is the maximum amount of unreliability you're willing to tolerate in a given period, derived directly from your uptime target.

Error budget = 1 − SLA target

For common SLA targets:

Uptime SLAError Budget (Annual)Error Budget (Monthly)
99%3.65 days7.3 hours
99.5%1.83 days3.65 hours
99.9%8.76 hours43.8 minutes
99.95%4.38 hours21.9 minutes
99.99%52.6 minutes4.38 minutes

See what does 99.9% uptime really mean? for a more detailed breakdown of what these numbers mean in practice.

How Error Budgets Are Used

The key insight of the error budget model is this: if you're not spending your error budget, you're being too conservative. An organisation with a 99.9% SLA that achieves 99.99% uptime has "left error budget on the table" — they could have shipped more features, experimented more aggressively, or deployed more frequently.

Linking Reliability to Velocity

Error budgets give product and engineering teams a shared language for reliability trade-offs:

  • Budget available: Go ahead and ship that risky refactor, deploy that major feature, run that database migration
  • Budget nearly exhausted: Slow down. Focus on reliability improvements. Defer risky changes until the budget resets.
  • Budget exceeded: Freeze all non-critical deployments. Focus entirely on stabilising the service.

This removes the subjective argument of "is it safe to ship?" and replaces it with a data-driven answer: "how much budget do we have left?"

Error Budget Policies

Organisations with mature reliability practices document their error budget policies explicitly:

  • What triggers a budget freeze? (e.g., 50% of monthly budget consumed in the first week)
  • What decisions can engineers make autonomously vs. require manager approval?
  • How are budget resets handled? (monthly, quarterly)

Calculating and Tracking Your Error Budget

Step 1: Define Your SLA

Your uptime SLA is your starting point. If you haven't formally defined one, start with a realistic target based on your infrastructure and team capacity.

Step 2: Measure Your Actual Uptime

This is where website uptime monitoring becomes essential. You can't track error budget consumption without accurate uptime data.

Your monitoring tool's reports give you:

  • Total downtime duration for a period
  • Number of incidents
  • Uptime percentage

Step 3: Calculate Budget Consumption

If your monthly error budget for 99.9% SLA is 43.8 minutes, and you had 12 minutes of downtime this month:

Budget consumed: 12 / 43.8 = 27.4% Budget remaining: 72.6%

Step 4: Trend Over Time

Track your error budget consumption week over week. A budget that's 80% consumed in week 2 of 4 is a signal to slow down.

Error Budgets for Small Teams

Error budgets originated in large-scale SRE organisations like Google, but the concept scales down to small teams and products.

A simplified approach for smaller organisations:

  1. Define a monthly uptime target (e.g., 99.9%)
  2. Calculate your monthly budget (43.8 minutes of allowed downtime)
  3. Track your actual downtime from your monitoring reports
  4. Review monthly: are you consuming budget faster than expected?

You don't need a formal error budget policy to benefit from this thinking. Simply knowing that you have X minutes of allowed downtime this month — and knowing how much you've used — changes how you make deployment decisions.

What Consumes Error Budget?

Any period where your service is unavailable or degraded below your SLA threshold consumes error budget. This includes:

  • Unplanned outages — server failures, application crashes, infrastructure issues
  • Planned maintenance — some teams count planned downtime against the budget, others exclude it
  • Partial outages — degraded performance that meets a threshold (e.g., error rate > 5%)
  • Regional outages — if users in one geographic region can't access your service

Your uptime monitoring records all of these, providing the data you need for accurate budget calculations.

Error Budgets and MTTR

Error budget consumption is the product of incident frequency and incident duration. Reducing either reduces budget consumption:

  • Fewer incidents → more budget available (improve code quality, testing, infrastructure reliability)
  • Shorter incidents → more budget available (improve monitoring, alerting, runbooks, and on-call procedures)

This is why MTTR and error budget tracking work together — minimising MTTR directly reduces error budget consumption.

Getting Started

You don't need complex tooling to start tracking error budgets. A monitoring tool like Domain Monitor provides the uptime data you need. A simple spreadsheet can calculate budget consumption from that data.

As your reliability practices mature, dedicated SLO tracking tools (Grafana SLO, Datadog SLOs, Google Cloud SLOs) can automate the calculation and alerting.


Track your uptime data with precision at Domain Monitor.

More posts

What Is a Subdomain Takeover and How to Prevent It

A subdomain takeover lets an attacker claim your subdomain by exploiting dangling DNS records. Learn how it happens, real-world examples, and how DNS monitoring detects it.

Read more
What Is Mean Time to Detect (MTTD)?

Mean time to detect (MTTD) measures how long it takes to discover an incident after it starts. Reducing MTTD is one of the highest-leverage improvements in reliability engineering.

Read more
What Is Black Box Monitoring?

Black box monitoring tests your systems from the outside, the way users experience them — without access to internal code or infrastructure. Learn how it works and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.