
Background jobs are the silent workhorses of modern applications. They send emails, process payments, resize images, generate reports, sync data with external services, and handle anything that should not block a web request. When they fail, they fail silently — no error page, no 500 response, no visible signal to end users.
That silence is the problem. A failed email queue means users never receive password resets. A stalled payment reconciliation job means your financial records are wrong. A missed nightly report means your stakeholders are working with stale data. The failure mode of background jobs is not a crash — it is a slow, invisible degradation that compounds over time.
Understanding what you are monitoring starts with knowing the types:
Queue-based workers process jobs submitted to a queue. Examples include Sidekiq (Ruby), Celery (Python), Bull (Node.js), Horizon (Laravel), and Resque. Jobs are created by web requests and consumed by separate worker processes. The queue accumulates if workers are down or overwhelmed.
Scheduled tasks (cron jobs) run on a time-based schedule regardless of external triggers. Examples include nightly reconciliation, daily report generation, weekly digest emails, and SSL certificate renewal (certbot). These are time-critical — a missed window may mean the next run is 24 hours away.
Long-running processes perform ongoing work — data pipeline processors, message stream consumers (Kafka, SQS), real-time sync services. These should run continuously; gaps indicate process death.
One-off background tasks are triggered by specific events: post-signup onboarding sequences, export generation, bulk data migrations.
The most reliable way to monitor scheduled tasks is heartbeat monitoring — also called dead man's switch monitoring. The concept is simple: at the end of a successful job run, the job sends a HTTP ping to a monitoring service. If the ping is not received within the expected window, an alert fires.
# Example: daily backup job with heartbeat
0 3 * * * /scripts/backup.sh && curl -s https://domain-monitor.io/heartbeat/abc123
The key advantage is that heartbeat monitoring detects absence of success rather than presence of failure. If the job never runs (process died, server rebooted, cron daemon stopped), the ping never arrives, and you are alerted. This catches failure modes that exception handlers cannot.
Domain Monitor supports heartbeat monitoring with configurable intervals and grace periods. Set the expected interval to match your cron schedule plus a realistic runtime buffer.
See how to monitor cron jobs for a deeper guide on heartbeat setup patterns.
For queue-based systems, monitor the number of jobs waiting to be processed. A growing queue is an early warning sign — either workers have stopped, or demand has outpaced capacity.
Sidekiq (Ruby):
# Expose via health endpoint
require 'sidekiq/api'
get '/health/workers' do
stats = Sidekiq::Stats.new
{
queued: stats.queued,
processed: stats.processed,
failed: stats.failed,
workers: Sidekiq::Workers.new.size
}.to_json
end
Bull (Node.js):
app.get('/health/queue', async (req, res) => {
const waiting = await queue.getWaitingCount();
const active = await queue.getActiveCount();
const failed = await queue.getFailedCount();
res.json({ waiting, active, failed });
});
Set an alert threshold for queue depth — for example, alert when the queue exceeds 500 jobs if your normal steady state is under 50. This gives you early warning before users are affected.
Monitor that worker processes are running. The mechanism depends on how workers are managed:
Systemd:
systemctl is-active sidekiq.service
# Returns 'active' or 'inactive'
Supervisor:
supervisorctl status celery
# Shows running/stopped status
Docker:
healthcheck:
test: ["CMD", "sidekiqmon", "--check"]
interval: 30s
timeout: 10s
retries: 3
Expose worker health through your application health endpoint so external monitoring can verify workers are running. A health endpoint that returns HTTP 200 only when all critical workers are active lets Domain Monitor alert you the moment workers stop.
Most queue systems retain failed jobs for inspection. Monitor the failed job count over time — a sudden spike in failures indicates a dependency problem or code bug.
Configure your queue system to:
Tools like Sentry can capture exceptions within background jobs, giving you full stack traces for each failure.
For critical scheduled tasks, write a log entry at the start and end of each run, including the timestamp and outcome. Store these logs in a central location. A missing end-of-run log entry is a failure signal.
import logging
from datetime import datetime
def run_reconciliation():
logging.info(f"Reconciliation started at {datetime.utcnow()}")
try:
# ... reconciliation logic ...
logging.info(f"Reconciliation completed at {datetime.utcnow()}")
requests.get(HEARTBEAT_URL) # ping monitoring
except Exception as e:
logging.error(f"Reconciliation failed: {e}")
raise
| Failure Type | Detection Method | Alert Priority |
|---|---|---|
| Cron job missed entirely | Heartbeat not received | P1 — immediate |
| Queue depth growing | Queue depth threshold | P2 — within 15 min |
| Worker process down | Health endpoint failure | P1 — immediate |
| Failed job count spike | Queue system metrics | P2 — within 15 min |
| Job taking too long | Execution time threshold | P3 — investigate |
When a background job failure is detected, engineers need to know what to do. Write a runbook for each critical job:
Keep runbooks updated in your team's documentation alongside your incident response procedures. See how to write a post-incident report for documenting failures after the fact.
The core challenge with background jobs is that they are invisible to users until the impact surfaces — and by then, the damage is done. A payment that was not processed, a welcome email that was not sent, a report that contains yesterday's data.
The only solution is proactive monitoring. External heartbeat monitoring, queue depth tracking, and worker health endpoints transform invisible failures into detectable events. Build this monitoring into every background job you write, not as an afterthought.
Set up heartbeat monitoring for your scheduled tasks at Domain Monitor — detect missed jobs before users notice the consequences.
Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.
Read moreCursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.
Read moreClaude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.
Read moreLooking to monitor your website and domains? Join our platform and start today.