
When something goes wrong with your website or service, two things matter: how quickly you find out, and how quickly you fix it. Mean Time to Recovery (MTTR) measures the average total duration of an outage — from the moment it starts until service is fully restored.
MTTR is one of the most important reliability metrics because it directly measures the customer impact of incidents, not just their frequency.
Mean Time to Recovery is calculated as:
MTTR = Total Downtime Duration ÷ Number of Incidents
For example, if you had three incidents last month with durations of 15 minutes, 45 minutes, and 30 minutes, your MTTR would be 30 minutes.
Lower MTTR means shorter outages and less customer impact. Improving MTTR is one of the highest-leverage improvements you can make to your reliability.
MTTR is often discussed alongside Mean Time Between Failures (MTBF):
| Metric | Measures | Goal |
|---|---|---|
| MTTR | Average duration of incidents | Lower is better |
| MTBF | Average time between incidents | Higher is better |
MTBF tells you how often things break. MTTR tells you how quickly you recover when they do. Both matter, but for customer experience, MTTR often has the bigger impact — a service that breaks frequently but recovers in 2 minutes may be less disruptive than one that breaks rarely but takes 3 hours to restore.
MTTR consists of several phases, each of which can be optimised:
How long between the incident starting and you knowing about it?
Without monitoring, detection time can be hours — you find out when a customer complains, when a staff member notices, or during a routine check. With automated uptime monitoring and alerts, detection time drops to under 60 seconds.
Detection time is the single biggest lever for improving MTTR. You can't start recovering until you know there's something to recover from.
How long to identify the root cause?
This depends on:
Good monitoring helps here too. A monitor that shows you the exact HTTP status code, response body, and the time the issue started gives you a head start on diagnosis.
How long to actually fix the problem once you know what it is?
This varies enormously — from restarting a crashed process (30 seconds) to rolling back a bad deploy (5 minutes) to fixing a database corruption (hours). Automation, runbooks, and practiced incident response all reduce resolution time.
How long to confirm the fix worked and service is restored?
Your uptime monitoring plays a role here — watching for your monitors to turn green confirms service has been restored and tells you the exact restoration time.
To calculate MTTR, you need accurate incident records. Uptime monitoring provides this automatically:
Good monitoring tools provide historical incident data and reporting that makes MTTR calculation trivial. Review your monthly uptime reports to track MTTR trends over time.
This is the fastest MTTR improvement available to most teams. If you're currently detecting outages when customers complain (average detection time: 30-120 minutes), moving to automated uptime monitoring (average detection time: 1-2 minutes) immediately cuts MTTR dramatically.
Set up website uptime monitoring with 1-minute check intervals and multi-channel alerts (email, SMS, Slack).
Fast detection is useless if the alert sits unread in a shared inbox. Configure alerts to reach the right person immediately:
A runbook is a documented procedure for responding to a specific type of incident. "Database connection pool exhausted: restart the app server and run X command." Written runbooks reduce diagnosis time by giving engineers a structured path to follow under pressure.
Teams that regularly respond to incidents develop muscle memory. Consider staging deliberate test incidents (with team awareness) to practice your incident response process.
Automated remediation — scripts that automatically restart crashed processes, roll back bad deploys, or scale resources — can reduce resolution time to near zero for common failure types.
Uptime SLAs promise a certain availability percentage. MTTR is directly related to how you achieve that SLA:
Understanding your incident rate helps you set realistic MTTR targets that are compatible with your uptime commitments.
Tracking MTTR over time gives you a concrete measure of improvement. If you implement better monitoring, better runbooks, and better alert routing this quarter, your MTTR should drop. If it doesn't, you need to investigate why.
Make MTTR a standard metric in your engineering team's retrospectives and quarterly reviews — alongside incident count and uptime percentage.
Cut your detection time to under 60 seconds with monitoring at Domain Monitor.
Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.
Read moreCursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.
Read moreClaude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.
Read moreLooking to monitor your website and domains? Join our platform and start today.