What Is an Error Budget? Balancing Reliability and Development Speed

If you've committed to a 99.9% uptime SLA, you've implicitly created an error budget — 0.1% of the time, or about 8.7 hours per year, where your service is allowed to be unavailable. How you spend that budget determines how quickly you can ship features and how much risk you can take with deployments.

Error budgets are a concept from Site Reliability Engineering (SRE) popularised by Google, and they provide a systematic framework for making reliability vs. velocity trade-offs.

What Is an Error Budget?

An error budget is the maximum amount of unreliability you're willing to tolerate in a given period, derived directly from your uptime target.

Error budget = 1 − SLA target

For common SLA targets:

Uptime SLA	Error Budget (Annual)	Error Budget (Monthly)
99%	3.65 days	7.3 hours
99.5%	1.83 days	3.65 hours
99.9%	8.76 hours	43.8 minutes
99.95%	4.38 hours	21.9 minutes
99.99%	52.6 minutes	4.38 minutes

See what does 99.9% uptime really mean? for a more detailed breakdown of what these numbers mean in practice.

How Error Budgets Are Used

The key insight of the error budget model is this: if you're not spending your error budget, you're being too conservative. An organisation with a 99.9% SLA that achieves 99.99% uptime has "left error budget on the table" — they could have shipped more features, experimented more aggressively, or deployed more frequently.

Linking Reliability to Velocity

Error budgets give product and engineering teams a shared language for reliability trade-offs:

Budget available: Go ahead and ship that risky refactor, deploy that major feature, run that database migration
Budget nearly exhausted: Slow down. Focus on reliability improvements. Defer risky changes until the budget resets.
Budget exceeded: Freeze all non-critical deployments. Focus entirely on stabilising the service.

This removes the subjective argument of "is it safe to ship?" and replaces it with a data-driven answer: "how much budget do we have left?"

Error Budget Policies

Organisations with mature reliability practices document their error budget policies explicitly:

What triggers a budget freeze? (e.g., 50% of monthly budget consumed in the first week)
What decisions can engineers make autonomously vs. require manager approval?
How are budget resets handled? (monthly, quarterly)

Calculating and Tracking Your Error Budget

Step 1: Define Your SLA

Your uptime SLA is your starting point. If you haven't formally defined one, start with a realistic target based on your infrastructure and team capacity.

Step 2: Measure Your Actual Uptime

This is where website uptime monitoring becomes essential. You can't track error budget consumption without accurate uptime data.

Your monitoring tool's reports give you:

Total downtime duration for a period
Number of incidents
Uptime percentage

Step 3: Calculate Budget Consumption

If your monthly error budget for 99.9% SLA is 43.8 minutes, and you had 12 minutes of downtime this month:

Budget consumed: 12 / 43.8 = 27.4% Budget remaining: 72.6%

Step 4: Trend Over Time

Track your error budget consumption week over week. A budget that's 80% consumed in week 2 of 4 is a signal to slow down.

Error Budgets for Small Teams

Error budgets originated in large-scale SRE organisations like Google, but the concept scales down to small teams and products.

A simplified approach for smaller organisations:

Define a monthly uptime target (e.g., 99.9%)
Calculate your monthly budget (43.8 minutes of allowed downtime)
Track your actual downtime from your monitoring reports
Review monthly: are you consuming budget faster than expected?

You don't need a formal error budget policy to benefit from this thinking. Simply knowing that you have X minutes of allowed downtime this month — and knowing how much you've used — changes how you make deployment decisions.

What Consumes Error Budget?

Any period where your service is unavailable or degraded below your SLA threshold consumes error budget. This includes:

Unplanned outages — server failures, application crashes, infrastructure issues
Planned maintenance — some teams count planned downtime against the budget, others exclude it
Partial outages — degraded performance that meets a threshold (e.g., error rate > 5%)
Regional outages — if users in one geographic region can't access your service

Your uptime monitoring records all of these, providing the data you need for accurate budget calculations.

Error Budgets and MTTR

Error budget consumption is the product of incident frequency and incident duration. Reducing either reduces budget consumption:

Fewer incidents → more budget available (improve code quality, testing, infrastructure reliability)
Shorter incidents → more budget available (improve monitoring, alerting, runbooks, and on-call procedures)

This is why MTTR and error budget tracking work together — minimising MTTR directly reduces error budget consumption.

Getting Started

You don't need complex tooling to start tracking error budgets. A monitoring tool like Domain Monitor provides the uptime data you need. A simple spreadsheet can calculate budget consumption from that data.

As your reliability practices mature, dedicated SLO tracking tools (Grafana SLO, Datadog SLOs, Google Cloud SLOs) can automate the calculation and alerting.

Track your uptime data with precision at Domain Monitor.

What Is an Error Budget? Balancing Reliability and Development Speed

What Is an Error Budget?

How Error Budgets Are Used

Linking Reliability to Velocity

Error Budget Policies

Calculating and Tracking Your Error Budget

Step 1: Define Your SLA

Step 2: Measure Your Actual Uptime

Step 3: Calculate Budget Consumption

Step 4: Trend Over Time

Error Budgets for Small Teams

What Consumes Error Budget?

Error Budgets and MTTR

Getting Started

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.

Domain Monitoring

Uptime Monitoring

SSL Monitoring

WHOIS Lookup

Notifications

Status Pages

Ping test

Traceroute test

Find my website's IP

# website monitoring

What Is an Error Budget? Balancing Reliability and Development Speed

What Is an Error Budget?

How Error Budgets Are Used

Linking Reliability to Velocity

Error Budget Policies

Calculating and Tracking Your Error Budget

Step 1: Define Your SLA

Step 2: Measure Your Actual Uptime

Step 3: Calculate Budget Consumption

Step 4: Trend Over Time

Error Budgets for Small Teams

What Consumes Error Budget?

Error Budgets and MTTR

Getting Started

Related Articles

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.