Error budget dashboard showing remaining SLA allowance and uptime consumption over time
# website monitoring

What Is an Error Budget? Balancing Reliability and Development Speed

If you've committed to a 99.9% uptime SLA, you've implicitly created an error budget — 0.1% of the time, or about 8.7 hours per year, where your service is allowed to be unavailable. How you spend that budget determines how quickly you can ship features and how much risk you can take with deployments.

Error budgets are a concept from Site Reliability Engineering (SRE) popularised by Google, and they provide a systematic framework for making reliability vs. velocity trade-offs.

What Is an Error Budget?

An error budget is the maximum amount of unreliability you're willing to tolerate in a given period, derived directly from your uptime target.

Error budget = 1 − SLA target

For common SLA targets:

Uptime SLAError Budget (Annual)Error Budget (Monthly)
99%3.65 days7.3 hours
99.5%1.83 days3.65 hours
99.9%8.76 hours43.8 minutes
99.95%4.38 hours21.9 minutes
99.99%52.6 minutes4.38 minutes

See what does 99.9% uptime really mean? for a more detailed breakdown of what these numbers mean in practice.

How Error Budgets Are Used

The key insight of the error budget model is this: if you're not spending your error budget, you're being too conservative. An organisation with a 99.9% SLA that achieves 99.99% uptime has "left error budget on the table" — they could have shipped more features, experimented more aggressively, or deployed more frequently.

Linking Reliability to Velocity

Error budgets give product and engineering teams a shared language for reliability trade-offs:

  • Budget available: Go ahead and ship that risky refactor, deploy that major feature, run that database migration
  • Budget nearly exhausted: Slow down. Focus on reliability improvements. Defer risky changes until the budget resets.
  • Budget exceeded: Freeze all non-critical deployments. Focus entirely on stabilising the service.

This removes the subjective argument of "is it safe to ship?" and replaces it with a data-driven answer: "how much budget do we have left?"

Error Budget Policies

Organisations with mature reliability practices document their error budget policies explicitly:

  • What triggers a budget freeze? (e.g., 50% of monthly budget consumed in the first week)
  • What decisions can engineers make autonomously vs. require manager approval?
  • How are budget resets handled? (monthly, quarterly)

Calculating and Tracking Your Error Budget

Step 1: Define Your SLA

Your uptime SLA is your starting point. If you haven't formally defined one, start with a realistic target based on your infrastructure and team capacity.

Step 2: Measure Your Actual Uptime

This is where website uptime monitoring becomes essential. You can't track error budget consumption without accurate uptime data.

Your monitoring tool's reports give you:

  • Total downtime duration for a period
  • Number of incidents
  • Uptime percentage

Step 3: Calculate Budget Consumption

If your monthly error budget for 99.9% SLA is 43.8 minutes, and you had 12 minutes of downtime this month:

Budget consumed: 12 / 43.8 = 27.4% Budget remaining: 72.6%

Step 4: Trend Over Time

Track your error budget consumption week over week. A budget that's 80% consumed in week 2 of 4 is a signal to slow down.

Error Budgets for Small Teams

Error budgets originated in large-scale SRE organisations like Google, but the concept scales down to small teams and products.

A simplified approach for smaller organisations:

  1. Define a monthly uptime target (e.g., 99.9%)
  2. Calculate your monthly budget (43.8 minutes of allowed downtime)
  3. Track your actual downtime from your monitoring reports
  4. Review monthly: are you consuming budget faster than expected?

You don't need a formal error budget policy to benefit from this thinking. Simply knowing that you have X minutes of allowed downtime this month — and knowing how much you've used — changes how you make deployment decisions.

What Consumes Error Budget?

Any period where your service is unavailable or degraded below your SLA threshold consumes error budget. This includes:

  • Unplanned outages — server failures, application crashes, infrastructure issues
  • Planned maintenance — some teams count planned downtime against the budget, others exclude it
  • Partial outages — degraded performance that meets a threshold (e.g., error rate > 5%)
  • Regional outages — if users in one geographic region can't access your service

Your uptime monitoring records all of these, providing the data you need for accurate budget calculations.

Error Budgets and MTTR

Error budget consumption is the product of incident frequency and incident duration. Reducing either reduces budget consumption:

  • Fewer incidents → more budget available (improve code quality, testing, infrastructure reliability)
  • Shorter incidents → more budget available (improve monitoring, alerting, runbooks, and on-call procedures)

This is why MTTR and error budget tracking work together — minimising MTTR directly reduces error budget consumption.

Getting Started

You don't need complex tooling to start tracking error budgets. A monitoring tool like Domain Monitor provides the uptime data you need. A simple spreadsheet can calculate budget consumption from that data.

As your reliability practices mature, dedicated SLO tracking tools (Grafana SLO, Datadog SLOs, Google Cloud SLOs) can automate the calculation and alerting.


Track your uptime data with precision at Domain Monitor.

More posts

What Is Generative AI? How It Works and What It Creates

Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.

Read more
What Is Cursor AI? The AI Code Editor Explained

Cursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.

Read more
What Is Claude Opus? Anthropic's Most Powerful Model Explained

Claude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.