
PostgreSQL is stable, capable, and widely trusted — but it has specific failure modes that catch developers off guard. Connection limits hit silently, table bloat accumulates, long-running transactions block other queries, and autovacuum falls behind under heavy write loads. None of these announce themselves loudly until something breaks.
Here's how to monitor the things that matter in a production Postgres setup.
Start with the fundamentals — can your application connect to Postgres?
# Python / SQLAlchemy
@app.route('/health')
def health():
try:
with db.engine.connect() as conn:
conn.execute(text('SELECT 1'))
return jsonify({'status': 'ok', 'database': 'ok'}), 200
except Exception as e:
return jsonify({'status': 'degraded', 'database': str(e)}), 503
// Node.js / pg
app.get('/health', async (req, res) => {
try {
await pool.query('SELECT 1');
res.json({ status: 'ok', database: 'ok' });
} catch (err) {
res.status(503).json({ status: 'degraded', database: err.message });
}
});
Point your uptime monitor at this endpoint. When Postgres is down or unreachable, the health check returns 503 and you're alerted immediately.
PostgreSQL has a hard max_connections limit (default 100). When it's hit, new connections are rejected with FATAL: sorry, too many clients already — an error that takes down your application immediately.
Check current connection status:
-- Current vs maximum connections
SELECT
max_conn,
used,
ROUND(used::numeric / max_conn * 100, 1) AS usage_pct
FROM
(SELECT setting::int AS max_conn FROM pg_settings WHERE name = 'max_connections') mc,
(SELECT COUNT(*) AS used FROM pg_stat_activity) ua;
-- Breakdown by state
SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state;
-- Long-running connections
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle' AND query_start < NOW() - INTERVAL '5 minutes'
ORDER BY duration DESC;
Alert when connection usage exceeds 80%.
If you're hitting connection limits, PgBouncer is the standard solution. It sits between your application and Postgres, pooling connections so hundreds of application threads share a smaller number of actual Postgres connections.
# pgbouncer.ini
[databases]
yourdb = host=127.0.0.1 port=5432 dbname=yourdb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
Monitor PgBouncer alongside Postgres:
psql -h 127.0.0.1 -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;"
Enable pg_stat_statements for query-level performance tracking:
-- Enable pg_stat_statements (in postgresql.conf)
-- shared_preload_libraries = 'pg_stat_statements'
-- After restart, create extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top 10 slowest queries by total execution time
SELECT
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
calls,
LEFT(query, 80) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Also enable the slow query log:
# postgresql.conf
log_min_duration_statement = 1000 # Log queries over 1 second
log_checkpoints = on
log_connections = on
log_lock_waits = on
PostgreSQL uses MVCC (Multi-Version Concurrency Control) — updates and deletes leave dead tuples in tables. Autovacuum reclaims this space, but under heavy load it can fall behind, causing table bloat that degrades query performance.
Check for tables with excessive dead tuples:
SELECT
schemaname,
tablename,
n_live_tup,
n_dead_tup,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
A dead_pct over 20% on frequently accessed tables indicates autovacuum isn't keeping up. You may need to tune autovacuum settings or run VACUUM ANALYZE manually.
A transaction that's been open for hours can block autovacuum, hold locks, and prevent other queries from completing.
-- Transactions open for over 5 minutes
SELECT
pid,
now() - xact_start AS duration,
state,
LEFT(query, 100) AS current_query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
AND now() - xact_start > INTERVAL '5 minutes'
ORDER BY duration DESC;
Alert on transactions older than 30 minutes. Kill them if necessary:
-- Terminate a specific connection
SELECT pg_terminate_backend(pid);
-- On primary: check replication status
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
-- On replica: check lag
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_delay;
Alert when replication_delay exceeds your threshold — typically 30 seconds to a few minutes depending on your application's tolerance for stale reads.
PostgreSQL writes ahead logs (WAL), transaction logs, and table data. Full disks cause immediate failures.
# Check disk space
df -h /var/lib/postgresql
# Check Postgres data directory size
du -sh /var/lib/postgresql/*/main/
Alert at 80% disk usage. Pay particular attention to WAL accumulation — if a replication consumer falls behind, WAL files accumulate until they fill the disk.
Database-specific monitoring catches the database problems. Application uptime monitoring catches what users actually experience.
Domain Monitor monitors your application health check endpoint from multiple global locations every minute. When PostgreSQL issues cause user-facing failures, you're alerted immediately — not five minutes later. Create a free account.
See how to monitor MySQL availability for comparison with MySQL monitoring approaches, and uptime monitoring best practices for the full monitoring picture.
When your site goes down, your status page becomes the most important page you have. Here's why it matters, what happens when you don't have one, and what a good status page does during a real outage.
Read moreYour domain is resolving, but pointing to the wrong server — showing old content, a previous host's page, or someone else's site entirely. Here's what causes this and how to diagnose it.
Read moreUptime monitoring isn't foolproof. Single-location monitors, wrong health check endpoints, long check intervals, and false positives can all cause real downtime to go undetected. Here's what to watch out for.
Read moreLooking to monitor your website and domains? Join our platform and start today.