Monitoring Background Jobs and Scheduled Tasks

Background jobs are the silent workhorses of modern applications. They send emails, process payments, resize images, generate reports, sync data with external services, and handle anything that should not block a web request. When they fail, they fail silently — no error page, no 500 response, no visible signal to end users.

That silence is the problem. A failed email queue means users never receive password resets. A stalled payment reconciliation job means your financial records are wrong. A missed nightly report means your stakeholders are working with stale data. The failure mode of background jobs is not a crash — it is a slow, invisible degradation that compounds over time.

Types of Background Work

Understanding what you are monitoring starts with knowing the types:

Queue-based workers process jobs submitted to a queue. Examples include Sidekiq (Ruby), Celery (Python), Bull (Node.js), Horizon (Laravel), and Resque. Jobs are created by web requests and consumed by separate worker processes. The queue accumulates if workers are down or overwhelmed.

Scheduled tasks (cron jobs) run on a time-based schedule regardless of external triggers. Examples include nightly reconciliation, daily report generation, weekly digest emails, and SSL certificate renewal (certbot). These are time-critical — a missed window may mean the next run is 24 hours away.

Long-running processes perform ongoing work — data pipeline processors, message stream consumers (Kafka, SQS), real-time sync services. These should run continuously; gaps indicate process death.

One-off background tasks are triggered by specific events: post-signup onboarding sequences, export generation, bulk data migrations.

Why Background Jobs Fail

Worker process crashes — the worker process dies due to an unhandled exception, OOM kill, or OS signal, and is not restarted
Dependency failures — the job depends on a database, cache, external API, or file system that is unavailable
Queue accumulation — jobs are submitted faster than workers can process them; the queue grows until jobs time out or expire
Deadlocks and locks — jobs that acquire database locks can deadlock, stalling other jobs waiting for the same lock
Configuration drift — cron schedule changed on one server but not others; job runs on wrong schedule or not at all
Deployment gaps — workers were not restarted after a deployment, running old code against a new database schema

Monitoring Strategies

Heartbeat Monitoring for Cron Jobs

The most reliable way to monitor scheduled tasks is heartbeat monitoring — also called dead man's switch monitoring. The concept is simple: at the end of a successful job run, the job sends a HTTP ping to a monitoring service. If the ping is not received within the expected window, an alert fires.

# Example: daily backup job with heartbeat
0 3 * * * /scripts/backup.sh && curl -s https://domain-monitor.io/heartbeat/abc123

The key advantage is that heartbeat monitoring detects absence of success rather than presence of failure. If the job never runs (process died, server rebooted, cron daemon stopped), the ping never arrives, and you are alerted. This catches failure modes that exception handlers cannot.

Domain Monitor supports heartbeat monitoring with configurable intervals and grace periods. Set the expected interval to match your cron schedule plus a realistic runtime buffer.

See how to monitor cron jobs for a deeper guide on heartbeat setup patterns.

Queue Depth Monitoring

For queue-based systems, monitor the number of jobs waiting to be processed. A growing queue is an early warning sign — either workers have stopped, or demand has outpaced capacity.

Sidekiq (Ruby):

# Expose via health endpoint
require 'sidekiq/api'

get '/health/workers' do
  stats = Sidekiq::Stats.new
  {
    queued: stats.queued,
    processed: stats.processed,
    failed: stats.failed,
    workers: Sidekiq::Workers.new.size
  }.to_json
end

Bull (Node.js):

app.get('/health/queue', async (req, res) => {
  const waiting = await queue.getWaitingCount();
  const active = await queue.getActiveCount();
  const failed = await queue.getFailedCount();
  res.json({ waiting, active, failed });
});

Set an alert threshold for queue depth — for example, alert when the queue exceeds 500 jobs if your normal steady state is under 50. This gives you early warning before users are affected.

Worker Process Health Checks

Monitor that worker processes are running. The mechanism depends on how workers are managed:

Systemd:

systemctl is-active sidekiq.service
# Returns 'active' or 'inactive'

Supervisor:

supervisorctl status celery
# Shows running/stopped status

Docker:

healthcheck:
  test: ["CMD", "sidekiqmon", "--check"]
  interval: 30s
  timeout: 10s
  retries: 3

Expose worker health through your application health endpoint so external monitoring can verify workers are running. A health endpoint that returns HTTP 200 only when all critical workers are active lets Domain Monitor alert you the moment workers stop.

Failed Job Tracking

Most queue systems retain failed jobs for inspection. Monitor the failed job count over time — a sudden spike in failures indicates a dependency problem or code bug.

Configure your queue system to:

Retry failed jobs with exponential backoff (3-5 retries over increasing intervals)
Move jobs to a dead letter queue after maximum retries
Alert when the dead letter queue exceeds a threshold

Tools like Sentry can capture exceptions within background jobs, giving you full stack traces for each failure.

Scheduled Task Execution Logs

For critical scheduled tasks, write a log entry at the start and end of each run, including the timestamp and outcome. Store these logs in a central location. A missing end-of-run log entry is a failure signal.

import logging
from datetime import datetime

def run_reconciliation():
    logging.info(f"Reconciliation started at {datetime.utcnow()}")
    try:
        # ... reconciliation logic ...
        logging.info(f"Reconciliation completed at {datetime.utcnow()}")
        requests.get(HEARTBEAT_URL)  # ping monitoring
    except Exception as e:
        logging.error(f"Reconciliation failed: {e}")
        raise

Alerting for Background Job Failures

Failure Type	Detection Method	Alert Priority
Cron job missed entirely	Heartbeat not received	P1 — immediate
Queue depth growing	Queue depth threshold	P2 — within 15 min
Worker process down	Health endpoint failure	P1 — immediate
Failed job count spike	Queue system metrics	P2 — within 15 min
Job taking too long	Execution time threshold	P3 — investigate

Recovery and Runbooks

When a background job failure is detected, engineers need to know what to do. Write a runbook for each critical job:

How to check if the job is currently running
How to restart the worker if it has stopped
How to reprocess failed jobs from the dead letter queue
How to determine if any data was corrupted or lost
Who to notify if data integrity is affected

Keep runbooks updated in your team's documentation alongside your incident response procedures. See how to write a post-incident report for documenting failures after the fact.

The Invisible Failure Problem

The core challenge with background jobs is that they are invisible to users until the impact surfaces — and by then, the damage is done. A payment that was not processed, a welcome email that was not sent, a report that contains yesterday's data.

The only solution is proactive monitoring. External heartbeat monitoring, queue depth tracking, and worker health endpoints transform invisible failures into detectable events. Build this monitoring into every background job you write, not as an afterthought.

Set up heartbeat monitoring for your scheduled tasks at Domain Monitor — detect missed jobs before users notice the consequences.

Monitoring Background Jobs and Scheduled Tasks

Types of Background Work

Why Background Jobs Fail

Monitoring Strategies

Heartbeat Monitoring for Cron Jobs

Queue Depth Monitoring

Worker Process Health Checks

Failed Job Tracking

Scheduled Task Execution Logs

Alerting for Background Job Failures

Recovery and Runbooks

The Invisible Failure Problem

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.

Domain Monitoring

Uptime Monitoring

SSL Monitoring

WHOIS Lookup

Notifications

Status Pages

Ping test

Traceroute test

Find my website's IP

# website monitoring

Monitoring Background Jobs and Scheduled Tasks

Types of Background Work

Why Background Jobs Fail

Monitoring Strategies

Heartbeat Monitoring for Cron Jobs

Queue Depth Monitoring

Worker Process Health Checks

Failed Job Tracking

Scheduled Task Execution Logs

Alerting for Background Job Failures

Recovery and Runbooks

The Invisible Failure Problem

Related Articles

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.