MTTR dashboard showing mean time to recovery metrics and incident resolution timeline
# website monitoring

What Is Mean Time to Recovery (MTTR) and How Do You Improve It?

When something goes wrong with your website or service, two things matter: how quickly you find out, and how quickly you fix it. Mean Time to Recovery (MTTR) measures the average total duration of an outage — from the moment it starts until service is fully restored.

MTTR is one of the most important reliability metrics because it directly measures the customer impact of incidents, not just their frequency.

MTTR Definition

Mean Time to Recovery is calculated as:

MTTR = Total Downtime Duration ÷ Number of Incidents

For example, if you had three incidents last month with durations of 15 minutes, 45 minutes, and 30 minutes, your MTTR would be 30 minutes.

Lower MTTR means shorter outages and less customer impact. Improving MTTR is one of the highest-leverage improvements you can make to your reliability.

MTTR vs. MTBF

MTTR is often discussed alongside Mean Time Between Failures (MTBF):

MetricMeasuresGoal
MTTRAverage duration of incidentsLower is better
MTBFAverage time between incidentsHigher is better

MTBF tells you how often things break. MTTR tells you how quickly you recover when they do. Both matter, but for customer experience, MTTR often has the bigger impact — a service that breaks frequently but recovers in 2 minutes may be less disruptive than one that breaks rarely but takes 3 hours to restore.

The Components of MTTR

MTTR consists of several phases, each of which can be optimised:

1. Detection Time

How long between the incident starting and you knowing about it?

Without monitoring, detection time can be hours — you find out when a customer complains, when a staff member notices, or during a routine check. With automated uptime monitoring and alerts, detection time drops to under 60 seconds.

Detection time is the single biggest lever for improving MTTR. You can't start recovering until you know there's something to recover from.

2. Diagnosis Time

How long to identify the root cause?

This depends on:

  • Quality and accessibility of logs
  • Clarity of error messages
  • Team experience with the system
  • Whether the issue is obvious (server down) or subtle (memory leak)

Good monitoring helps here too. A monitor that shows you the exact HTTP status code, response body, and the time the issue started gives you a head start on diagnosis.

3. Resolution Time

How long to actually fix the problem once you know what it is?

This varies enormously — from restarting a crashed process (30 seconds) to rolling back a bad deploy (5 minutes) to fixing a database corruption (hours). Automation, runbooks, and practiced incident response all reduce resolution time.

4. Verification Time

How long to confirm the fix worked and service is restored?

Your uptime monitoring plays a role here — watching for your monitors to turn green confirms service has been restored and tells you the exact restoration time.

How to Measure Your MTTR

To calculate MTTR, you need accurate incident records. Uptime monitoring provides this automatically:

  • Incident start time — when the monitor first detected the failure
  • Incident end time — when the monitor confirmed recovery
  • Duration — calculated automatically

Good monitoring tools provide historical incident data and reporting that makes MTTR calculation trivial. Review your monthly uptime reports to track MTTR trends over time.

How to Improve MTTR

1. Reduce Detection Time with Monitoring

This is the fastest MTTR improvement available to most teams. If you're currently detecting outages when customers complain (average detection time: 30-120 minutes), moving to automated uptime monitoring (average detection time: 1-2 minutes) immediately cuts MTTR dramatically.

Set up website uptime monitoring with 1-minute check intervals and multi-channel alerts (email, SMS, Slack).

2. Improve Alert Routing

Fast detection is useless if the alert sits unread in a shared inbox. Configure alerts to reach the right person immediately:

  • SMS to the on-call engineer's mobile
  • Slack to a dedicated monitoring channel
  • Escalation paths if the primary contact doesn't respond

3. Build Runbooks

A runbook is a documented procedure for responding to a specific type of incident. "Database connection pool exhausted: restart the app server and run X command." Written runbooks reduce diagnosis time by giving engineers a structured path to follow under pressure.

4. Practice Incident Response

Teams that regularly respond to incidents develop muscle memory. Consider staging deliberate test incidents (with team awareness) to practice your incident response process.

5. Invest in Automation

Automated remediation — scripts that automatically restart crashed processes, roll back bad deploys, or scale resources — can reduce resolution time to near zero for common failure types.

MTTR and Uptime SLAs

Uptime SLAs promise a certain availability percentage. MTTR is directly related to how you achieve that SLA:

  • 99.9% uptime = 8.7 hours of downtime per year
  • With 12 incidents per year, you can afford an average MTTR of 43 minutes
  • With 50 incidents per year, you need an average MTTR of 10 minutes

Understanding your incident rate helps you set realistic MTTR targets that are compatible with your uptime commitments.

MTTR as a Team Metric

Tracking MTTR over time gives you a concrete measure of improvement. If you implement better monitoring, better runbooks, and better alert routing this quarter, your MTTR should drop. If it doesn't, you need to investigate why.

Make MTTR a standard metric in your engineering team's retrospectives and quarterly reviews — alongside incident count and uptime percentage.


Cut your detection time to under 60 seconds with monitoring at Domain Monitor.

More posts

What Is Generative AI? How It Works and What It Creates

Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.

Read more
What Is Cursor AI? The AI Code Editor Explained

Cursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.

Read more
What Is Claude Opus? Anthropic's Most Powerful Model Explained

Claude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.