
The database is almost always the bottleneck, and often the failure point. When MySQL goes down — or gets slow enough that it might as well be down — your entire application suffers. The challenge is that database problems rarely announce themselves with an obvious error. Instead, you see slow page loads, intermittent 500 errors, and timeouts that are difficult to trace back to their source without proper monitoring.
The first layer: is MySQL accepting connections at all?
Add a database connectivity check to your application's health endpoint:
# Flask / Python
from sqlalchemy import text
@app.route('/health')
def health():
try:
db.session.execute(text('SELECT 1'))
db_status = 'ok'
except Exception as e:
db_status = str(e)
return jsonify({
'status': 'ok' if db_status == 'ok' else 'degraded',
'database': db_status
}), 200 if db_status == 'ok' else 503
// Laravel
Route::get('/health', function () {
try {
DB::select('SELECT 1');
$db = 'ok';
} catch (\Exception $e) {
$db = $e->getMessage();
}
$status = $db === 'ok' ? 200 : 503;
return response()->json(['status' => $db === 'ok' ? 'ok' : 'degraded', 'database' => $db], $status);
});
Point your uptime monitor at this endpoint. A 503 response tells you MySQL is unreachable before users notice.
For more granular monitoring, check MySQL directly:
# Quick connectivity test
mysqladmin -u monitor_user -p'password' -h 127.0.0.1 ping
# Returns "mysqld is alive" or fails with an error
Create a dedicated read-only monitoring user with minimal permissions:
CREATE USER 'monitor'@'localhost' IDENTIFIED BY 'monitor_password';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'monitor'@'localhost';
FLUSH PRIVILEGES;
A database that's up but running slow queries is almost as bad as one that's down. Slow queries cause timeouts, connection pool exhaustion, and cascading failures.
-- Enable slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1; -- Log queries taking over 1 second
SET GLOBAL slow_query_log_file = '/var/log/mysql/slow-queries.log';
-- Also log queries that don't use indexes
SET GLOBAL log_queries_not_using_indexes = 'ON';
Make these permanent in /etc/mysql/mysql.conf.d/mysqld.cnf:
[mysqld]
slow_query_log = 1
long_query_time = 1
slow_query_log_file = /var/log/mysql/slow-queries.log
log_queries_not_using_indexes = 1
mysqldumpslow summarises the slow query log:
# Show top 10 slowest queries by total time
mysqldumpslow -s t -t 10 /var/log/mysql/slow-queries.log
# Show queries with the most occurrences
mysqldumpslow -s c -t 10 /var/log/mysql/slow-queries.log
pt-query-digest (Percona Toolkit) provides more detailed analysis including fingerprinting and statistics.
long_query_time threshold per minuteConnection pool exhaustion causes Too many connections errors that take your application down even while MySQL itself is running fine.
-- Check current connections vs maximum
SHOW VARIABLES LIKE 'max_connections';
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Threads_running';
-- Check connection usage percentage
SELECT
@@max_connections AS max_connections,
COUNT(*) AS current_connections,
ROUND(COUNT(*) * 100 / @@max_connections, 1) AS usage_pct
FROM information_schema.processlist;
Alert when connection usage exceeds 80% of max_connections. At that point, you're at risk of exhaustion under any traffic spike.
Add a scheduled check to your monitoring:
#!/bin/bash
USAGE=$(mysql -u monitor -p'password' -e "SELECT ROUND(COUNT(*) * 100 / @@max_connections) FROM information_schema.processlist;" -s -N 2>/dev/null)
if [ "$USAGE" -gt 80 ]; then
# Send alert
echo "MySQL connection usage at ${USAGE}%" | mail -s "MySQL Connection Warning" [email protected]
fi
If you're running MySQL replication (primary-replica setup for read scaling or high availability), replication lag is a critical metric. High lag means your replicas are serving stale data.
-- On the replica
SHOW REPLICA STATUS\G
-- Key field: Seconds_Behind_Source (or Seconds_Behind_Master in older versions)
-- 0 = replica is current
-- High numbers = replica is falling behind
Alert on Seconds_Behind_Source exceeding your threshold — typically 30–60 seconds for most applications. For near-real-time applications, alert on anything above 5 seconds.
A replication script for scheduled monitoring:
import pymysql
def check_replication_lag(host, user, password):
conn = pymysql.connect(host=host, user=user, password=password)
cursor = conn.cursor(pymysql.cursors.DictCursor)
cursor.execute("SHOW REPLICA STATUS")
status = cursor.fetchone()
if status is None:
return None # Not a replica
lag = status.get('Seconds_Behind_Source', 0)
running = status.get('Replica_SQL_Running', 'No')
if running != 'Yes':
alert(f"Replication SQL thread is not running on {host}")
elif lag and lag > 60:
alert(f"Replication lag is {lag}s on {host}")
return lag
MySQL tables grow. Running out of disk space causes MySQL to crash or stop accepting writes immediately. Monitor disk usage on your MySQL data directory:
df -h /var/lib/mysql
Alert when disk usage exceeds 80%. Running out of space at 100% gives you no time to react.
All the database-specific monitoring above is complementary to application-level uptime monitoring. The quickest way to know when database problems are affecting users is a health check endpoint that tests the connection.
Domain Monitor monitors your application health check endpoint every minute. When your MySQL goes down or becomes so slow that queries time out, your health check returns 503 and you're alerted immediately. Create a free account.
For broader monitoring context, see uptime monitoring best practices and website monitoring checklist for developers.
When your site goes down, your status page becomes the most important page you have. Here's why it matters, what happens when you don't have one, and what a good status page does during a real outage.
Read moreYour domain is resolving, but pointing to the wrong server — showing old content, a previous host's page, or someone else's site entirely. Here's what causes this and how to diagnose it.
Read moreUptime monitoring isn't foolproof. Single-location monitors, wrong health check endpoints, long check intervals, and false positives can all cause real downtime to go undetected. Here's what to watch out for.
Read moreLooking to monitor your website and domains? Join our platform and start today.