Background job monitoring dashboard showing queue depth, worker health and missed cron job alerts
# website monitoring

Monitoring Background Jobs and Scheduled Tasks

Background jobs are the silent workhorses of modern applications. They send emails, process payments, resize images, generate reports, sync data with external services, and handle anything that should not block a web request. When they fail, they fail silently — no error page, no 500 response, no visible signal to end users.

That silence is the problem. A failed email queue means users never receive password resets. A stalled payment reconciliation job means your financial records are wrong. A missed nightly report means your stakeholders are working with stale data. The failure mode of background jobs is not a crash — it is a slow, invisible degradation that compounds over time.

Types of Background Work

Understanding what you are monitoring starts with knowing the types:

Queue-based workers process jobs submitted to a queue. Examples include Sidekiq (Ruby), Celery (Python), Bull (Node.js), Horizon (Laravel), and Resque. Jobs are created by web requests and consumed by separate worker processes. The queue accumulates if workers are down or overwhelmed.

Scheduled tasks (cron jobs) run on a time-based schedule regardless of external triggers. Examples include nightly reconciliation, daily report generation, weekly digest emails, and SSL certificate renewal (certbot). These are time-critical — a missed window may mean the next run is 24 hours away.

Long-running processes perform ongoing work — data pipeline processors, message stream consumers (Kafka, SQS), real-time sync services. These should run continuously; gaps indicate process death.

One-off background tasks are triggered by specific events: post-signup onboarding sequences, export generation, bulk data migrations.

Why Background Jobs Fail

  • Worker process crashes — the worker process dies due to an unhandled exception, OOM kill, or OS signal, and is not restarted
  • Dependency failures — the job depends on a database, cache, external API, or file system that is unavailable
  • Queue accumulation — jobs are submitted faster than workers can process them; the queue grows until jobs time out or expire
  • Deadlocks and locks — jobs that acquire database locks can deadlock, stalling other jobs waiting for the same lock
  • Configuration drift — cron schedule changed on one server but not others; job runs on wrong schedule or not at all
  • Deployment gaps — workers were not restarted after a deployment, running old code against a new database schema

Monitoring Strategies

Heartbeat Monitoring for Cron Jobs

The most reliable way to monitor scheduled tasks is heartbeat monitoring — also called dead man's switch monitoring. The concept is simple: at the end of a successful job run, the job sends a HTTP ping to a monitoring service. If the ping is not received within the expected window, an alert fires.

# Example: daily backup job with heartbeat
0 3 * * * /scripts/backup.sh && curl -s https://domain-monitor.io/heartbeat/abc123

The key advantage is that heartbeat monitoring detects absence of success rather than presence of failure. If the job never runs (process died, server rebooted, cron daemon stopped), the ping never arrives, and you are alerted. This catches failure modes that exception handlers cannot.

Domain Monitor supports heartbeat monitoring with configurable intervals and grace periods. Set the expected interval to match your cron schedule plus a realistic runtime buffer.

See how to monitor cron jobs for a deeper guide on heartbeat setup patterns.

Queue Depth Monitoring

For queue-based systems, monitor the number of jobs waiting to be processed. A growing queue is an early warning sign — either workers have stopped, or demand has outpaced capacity.

Sidekiq (Ruby):

# Expose via health endpoint
require 'sidekiq/api'

get '/health/workers' do
  stats = Sidekiq::Stats.new
  {
    queued: stats.queued,
    processed: stats.processed,
    failed: stats.failed,
    workers: Sidekiq::Workers.new.size
  }.to_json
end

Bull (Node.js):

app.get('/health/queue', async (req, res) => {
  const waiting = await queue.getWaitingCount();
  const active = await queue.getActiveCount();
  const failed = await queue.getFailedCount();
  res.json({ waiting, active, failed });
});

Set an alert threshold for queue depth — for example, alert when the queue exceeds 500 jobs if your normal steady state is under 50. This gives you early warning before users are affected.

Worker Process Health Checks

Monitor that worker processes are running. The mechanism depends on how workers are managed:

Systemd:

systemctl is-active sidekiq.service
# Returns 'active' or 'inactive'

Supervisor:

supervisorctl status celery
# Shows running/stopped status

Docker:

healthcheck:
  test: ["CMD", "sidekiqmon", "--check"]
  interval: 30s
  timeout: 10s
  retries: 3

Expose worker health through your application health endpoint so external monitoring can verify workers are running. A health endpoint that returns HTTP 200 only when all critical workers are active lets Domain Monitor alert you the moment workers stop.

Failed Job Tracking

Most queue systems retain failed jobs for inspection. Monitor the failed job count over time — a sudden spike in failures indicates a dependency problem or code bug.

Configure your queue system to:

  • Retry failed jobs with exponential backoff (3-5 retries over increasing intervals)
  • Move jobs to a dead letter queue after maximum retries
  • Alert when the dead letter queue exceeds a threshold

Tools like Sentry can capture exceptions within background jobs, giving you full stack traces for each failure.

Scheduled Task Execution Logs

For critical scheduled tasks, write a log entry at the start and end of each run, including the timestamp and outcome. Store these logs in a central location. A missing end-of-run log entry is a failure signal.

import logging
from datetime import datetime

def run_reconciliation():
    logging.info(f"Reconciliation started at {datetime.utcnow()}")
    try:
        # ... reconciliation logic ...
        logging.info(f"Reconciliation completed at {datetime.utcnow()}")
        requests.get(HEARTBEAT_URL)  # ping monitoring
    except Exception as e:
        logging.error(f"Reconciliation failed: {e}")
        raise

Alerting for Background Job Failures

Failure TypeDetection MethodAlert Priority
Cron job missed entirelyHeartbeat not receivedP1 — immediate
Queue depth growingQueue depth thresholdP2 — within 15 min
Worker process downHealth endpoint failureP1 — immediate
Failed job count spikeQueue system metricsP2 — within 15 min
Job taking too longExecution time thresholdP3 — investigate

Recovery and Runbooks

When a background job failure is detected, engineers need to know what to do. Write a runbook for each critical job:

  1. How to check if the job is currently running
  2. How to restart the worker if it has stopped
  3. How to reprocess failed jobs from the dead letter queue
  4. How to determine if any data was corrupted or lost
  5. Who to notify if data integrity is affected

Keep runbooks updated in your team's documentation alongside your incident response procedures. See how to write a post-incident report for documenting failures after the fact.

The Invisible Failure Problem

The core challenge with background jobs is that they are invisible to users until the impact surfaces — and by then, the damage is done. A payment that was not processed, a welcome email that was not sent, a report that contains yesterday's data.

The only solution is proactive monitoring. External heartbeat monitoring, queue depth tracking, and worker health endpoints transform invisible failures into detectable events. Build this monitoring into every background job you write, not as an afterthought.


Set up heartbeat monitoring for your scheduled tasks at Domain Monitor — detect missed jobs before users notice the consequences.

More posts

What Is Generative AI? How It Works and What It Creates

Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.

Read more
What Is Cursor AI? The AI Code Editor Explained

Cursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.

Read more
What Is Claude Opus? Anthropic's Most Powerful Model Explained

Claude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.