On-call management rotation calendar showing incident response schedule and escalation policy
# website monitoring

What Is On-Call Management for Website Incidents?

When your uptime monitoring fires an alert at 2am, someone needs to be ready to respond. On-call management is the system that ensures there's always a designated person ready to handle incidents — with clear escalation paths when the primary responder can't be reached.

Why On-Call Management Matters

For any production service that requires high availability, someone needs to be reachable 24/7. Without a formal on-call system:

  • Alerts may go to a shared email that nobody monitors overnight
  • Multiple people get paged for the same incident (confusion and duplication)
  • Nobody knows who is responsible, so everyone assumes someone else is handling it
  • The person who notices the alert first deals with it by chance rather than design

An on-call system makes the responsibility explicit, fair, and well-understood.

The On-Call Rotation

An on-call rotation is a schedule defining who is the designated incident responder at any given time. Common rotation patterns:

Weekly rotation: One person is primary on-call for a week at a time. Simple to schedule, but can be exhausting for the on-call person.

Daily handoff: On-call shifts change daily. More complex to schedule but distributes the burden more evenly.

Follow-the-sun: For global teams, on-call shifts align with working hours in different time zones — European team covers European hours, US team covers US hours. No one is on-call outside their working day.

Pooled rotation: A group of on-call engineers share responsibility, rotating primary and secondary positions.

Primary and Secondary On-Call

Most on-call systems have at least two tiers:

Primary on-call: Receives initial alerts. Expected to acknowledge within 5-10 minutes and begin investigation.

Secondary on-call: Receives escalated alerts if the primary doesn't acknowledge within the escalation timeout. Backup when the primary is unreachable.

This two-tier system prevents alerts from going unacknowledged — if the primary is asleep with phone on silent, the secondary catches it.

Alert Routing in Practice

Configure your uptime monitoring to deliver alerts at the right severity to the right people:

SeverityInitial AlertEscalation
P1 (complete outage)SMS to primary on-callAfter 5 min: SMS to secondary
P2 (major degradation)SMS to primary on-callAfter 10 min: Slack to team
P3 (partial issue)Slack to teamManual escalation if needed
P4 (minor)EmailNext business day

The downtime alerts guide covers configuring multiple recipients and alert channels.

Escalation Policies

An escalation policy defines what happens when alerts aren't acknowledged:

  1. Alert fires → Primary on-call receives SMS
  2. 5 minutes without acknowledgement → Secondary on-call receives SMS
  3. 10 minutes without acknowledgement → Engineering manager receives SMS
  4. Alert acknowledged → Escalation stops

Escalation ensures critical incidents always get a response, even when individual people are unreachable.

On-Call Fatigue and Burnout

On-call duty is stressful. Teams that don't manage it well experience:

  • Alert fatigue — too many false positive alerts, responders start ignoring them
  • Burnout — too much on-call duty, especially for small teams
  • Inequity — some team members carrying disproportionate on-call burden

Mitigation strategies:

  • Reduce alert noise: Configure confirmation counts to eliminate false positives
  • Rotate fairly: Distribute on-call weeks equitably across the team
  • Compensate: Pay on-call allowances or time off in lieu
  • Post-mortem to prevent recurrence: Repeated incidents at the same time are demoralising — fix root causes
  • Set standards: Define what constitutes a page-worthy incident vs. a next-day email

On-Call Runbooks

When the on-call engineer is paged at 3am, they shouldn't need to remember everything about the system. Write runbooks for your most common incidents:

  • How to restart the web server
  • How to check database connectivity
  • How to roll back a deployment
  • How to scale up infrastructure
  • Who to call if the issue is beyond your capability

Good runbooks reduce mean time to recovery dramatically. See also: incident response plan template.

Tools for On-Call Management

ToolWhat It Provides
PagerDutyFull on-call rotation, escalation, incident management
OpsGenieOn-call scheduling, alerts, escalation
VictorOps (Splunk)Incident response platform with on-call features
Better UptimeBuilt-in on-call with monitoring
Domain MonitorMonitoring + configurable multi-contact alerting

For small teams, configuring multiple alert recipients with priority escalation in Domain Monitor handles basic on-call routing without a dedicated tool. As the team grows, dedicated on-call tools provide more sophisticated rotation management.

Transitioning from Informal to Formal On-Call

If your team currently handles incidents informally ("whoever notices the alert deals with it"), transitioning to formal on-call:

  1. Document what services need 24/7 coverage
  2. Define severity levels and response time expectations
  3. Set up explicit on-call rotation starting next week
  4. Configure monitoring to route to the designated on-call person
  5. Write basic runbooks for the 3 most common incidents
  6. Review and adjust after the first rotation cycle

The transition is uncomfortable but worth it — clarity about who is responsible reduces both response time and team stress.


Set up alert routing for your on-call team at Domain Monitor.

More posts

What Is Generative AI? How It Works and What It Creates

Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.

Read more
What Is Cursor AI? The AI Code Editor Explained

Cursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.

Read more
What Is Claude Opus? Anthropic's Most Powerful Model Explained

Claude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.