
Queue workers are silent infrastructure. When they're healthy, nobody notices. When they fail, jobs accumulate silently, emails stop sending, reports stop generating, and webhooks stop processing — and nobody knows until a user complains or you happen to check.
Unlike a crashed web server (which immediately returns errors), a dead queue worker leaves your application appearing healthy from the outside while background work quietly piles up. This is what makes queue monitoring different from uptime monitoring — and why it needs a separate approach.
Queue worker monitoring has three distinct concerns:
Worker liveness — Are workers running at all? A worker process that has crashed, been OOM-killed, or failed to restart after a deploy means no jobs are being processed.
Queue depth — How many jobs are waiting? A growing queue indicates workers can't keep up with inflow, even if workers are technically running.
Job failure rate — Are jobs completing successfully? Workers can be running while most jobs are failing, which is just as bad as no workers.
All three need monitoring. Any one of them can be the failure point.
Laravel's Horizon dashboard gives you a UI overview, but it's not enough on its own for production alerting.
Expose a dedicated health check for your queue system:
// routes/api.php
Route::get('/health/queue', function () {
$failedJobs = DB::table('failed_jobs')->count();
$horizonStatus = Cache::get('horizon:status', 'inactive');
$health = [
'status' => $horizonStatus === 'running' ? 'ok' : 'degraded',
'horizon' => $horizonStatus,
'failed_jobs' => $failedJobs,
];
$statusCode = $horizonStatus === 'running' ? 200 : 503;
return response()->json($health, $statusCode);
});
Horizon writes its status to the cache — horizon:status will be running, paused, or inactive. Your uptime monitor can check this endpoint every minute and alert when the status isn't running.
Route::get('/health/queue', function () {
$queues = ['default', 'emails', 'reports'];
$depths = [];
foreach ($queues as $queue) {
$depths[$queue] = Queue::size($queue);
}
$maxDepth = max(array_values($depths));
$status = $maxDepth > 1000 ? 'degraded' : 'ok';
return response()->json([
'status' => $status,
'queues' => $depths,
'horizon' => Cache::get('horizon:status'),
], $status === 'ok' ? 200 : 503);
});
In config/horizon.php:
'environments' => [
'production' => [
'supervisor-1' => [
'connection' => 'redis',
'queue' => ['default'],
'balance' => 'auto',
'maxProcesses' => 10,
'minProcesses' => 3,
'tries' => 3,
'timeout' => 60,
],
],
],
'metrics' => [
'trim_snapshots' => [
'job' => 24,
'queue' => 24,
],
],
'waits' => [
'redis:default' => 60, // Alert if jobs wait > 60 seconds
],
For jobs that should run on a schedule, use a heartbeat pattern:
// In a scheduled command that runs every 5 minutes
class QueueHeartbeat extends Command
{
public function handle()
{
Cache::put('queue:heartbeat', now()->timestamp, 300);
}
}
// Health check
Route::get('/health/queue', function () {
$heartbeat = Cache::get('queue:heartbeat');
$age = $heartbeat ? now()->timestamp - $heartbeat : null;
$status = (!$heartbeat || $age > 600) ? 'degraded' : 'ok';
return response()->json([
'status' => $status,
'heartbeat_age_seconds' => $age,
], $status === 'ok' ? 200 : 503);
});
BullMQ is the standard choice for Node.js queue processing. Its built-in metrics make monitoring straightforward.
const { Queue } = require('bullmq');
const { createClient } = require('redis');
const connection = createClient({ url: process.env.REDIS_URL });
const emailQueue = new Queue('emails', { connection });
const reportQueue = new Queue('reports', { connection });
app.get('/health/queue', async (req, res) => {
try {
const [emailCounts, reportCounts] = await Promise.all([
emailQueue.getJobCounts('waiting', 'active', 'failed', 'completed'),
reportQueue.getJobCounts('waiting', 'active', 'failed', 'completed'),
]);
const totalFailed = emailCounts.failed + reportCounts.failed;
const totalWaiting = emailCounts.waiting + reportCounts.waiting;
const degraded = totalFailed > 50 || totalWaiting > 1000;
res.status(degraded ? 503 : 200).json({
status: degraded ? 'degraded' : 'ok',
queues: {
emails: emailCounts,
reports: reportCounts,
},
});
} catch (err) {
res.status(503).json({ status: 'error', message: err.message });
}
});
BullMQ workers have a isRunning() method, but if your worker is in a separate process you need another approach:
// In your worker process — write a heartbeat to Redis
const worker = new Worker('emails', processEmailJob, { connection });
worker.on('ready', () => {
console.log('Worker ready');
setInterval(async () => {
await connection.set('worker:emails:heartbeat', Date.now(), { EX: 120 });
}, 30000);
});
worker.on('error', (err) => {
console.error('Worker error:', err);
});
// In your health check — read the heartbeat
app.get('/health/queue', async (req, res) => {
const heartbeat = await connection.get('worker:emails:heartbeat');
const age = heartbeat ? Date.now() - parseInt(heartbeat) : null;
const workerAlive = age !== null && age < 90000; // 90 second threshold
res.status(workerAlive ? 200 : 503).json({
status: workerAlive ? 'ok' : 'degraded',
worker_age_ms: age,
});
});
BullMQ automatically marks jobs as stalled if a worker dies mid-processing:
const queueEvents = new QueueEvents('emails', { connection });
queueEvents.on('stalled', ({ jobId }) => {
console.error(`Job ${jobId} stalled — worker may have died`);
// Send alert to your monitoring system
});
queueEvents.on('failed', ({ jobId, failedReason }) => {
console.error(`Job ${jobId} failed: ${failedReason}`);
});
Celery is the standard queue processing library for Python. Monitoring requires a combination of the Celery inspection API and external health checks.
from celery import Celery
from flask import Flask, jsonify
app = Flask(__name__)
celery = Celery('tasks', broker=os.environ['REDIS_URL'])
@app.route('/health/queue')
def queue_health():
try:
# Check if any workers are responding
inspect = celery.control.inspect(timeout=2)
active = inspect.active()
if not active:
return jsonify({
'status': 'degraded',
'error': 'No workers responding'
}), 503
worker_count = len(active)
total_active_jobs = sum(len(jobs) for jobs in active.values())
return jsonify({
'status': 'ok',
'workers': worker_count,
'active_jobs': total_active_jobs,
})
except Exception as e:
return jsonify({'status': 'error', 'error': str(e)}), 503
import redis
r = redis.from_url(os.environ['REDIS_URL'])
@app.route('/health/queue')
def queue_health():
queues = ['celery', 'emails', 'reports']
depths = {}
for queue in queues:
depths[queue] = r.llen(queue)
max_depth = max(depths.values()) if depths else 0
status = 'degraded' if max_depth > 1000 else 'ok'
return jsonify({
'status': status,
'queues': depths,
}), 200 if status == 'ok' else 503
# tasks.py
@celery.task
def queue_heartbeat():
r = redis.from_url(os.environ['REDIS_URL'])
r.setex('queue:heartbeat', 300, int(time.time()))
# celerybeat-schedule (runs every 5 minutes)
CELERYBEAT_SCHEDULE = {
'queue-heartbeat': {
'task': 'tasks.queue_heartbeat',
'schedule': 300.0,
},
}
@app.route('/health/queue')
def queue_health():
heartbeat = r.get('queue:heartbeat')
if not heartbeat:
return jsonify({'status': 'degraded', 'error': 'No heartbeat'}), 503
age = int(time.time()) - int(heartbeat)
if age > 600: # 10 minutes
return jsonify({'status': 'degraded', 'heartbeat_age': age}), 503
return jsonify({'status': 'ok', 'heartbeat_age': age})
| Signal | Threshold | Severity |
|---|---|---|
| Worker not running | Immediate | P1 |
| Health endpoint returning 503 | Immediate | P1 |
| Queue depth growing | >1000 jobs | P2 |
| Job failure rate | >5% failure rate | P2 |
| Heartbeat missed | >10 minutes | P1 |
| Jobs stalled | Any | P2 |
The most reliable approach is a dedicated /health/queue endpoint that your uptime monitor checks every minute. It encapsulates all the queue-specific logic, and you get the same alerting path as your main application uptime.
Domain Monitor monitors your health check endpoints — including queue-specific ones — from multiple global locations every minute. Point it at /health/queue alongside your main health check and you'll know immediately when workers go down, not when a user reports that their email never arrived. Create a free account.
When your site goes down, your status page becomes the most important page you have. Here's why it matters, what happens when you don't have one, and what a good status page does during a real outage.
Read moreYour domain is resolving, but pointing to the wrong server — showing old content, a previous host's page, or someone else's site entirely. Here's what causes this and how to diagnose it.
Read moreUptime monitoring isn't foolproof. Single-location monitors, wrong health check endpoints, long check intervals, and false positives can all cause real downtime to go undetected. Here's what to watch out for.
Read moreLooking to monitor your website and domains? Join our platform and start today.