
Fly.io is a compelling platform — it runs your Docker containers close to users globally, scales machines automatically, and handles a lot of infrastructure concerns for you. It's increasingly popular for applications that need low latency globally without the complexity of managing multi-region infrastructure yourself.
Like every hosting platform, though, it has blind spots. Fly's internal health checks and metrics tell you about your machines. External monitoring tells you what users actually experience.
Fly.io supports health checks configured in your fly.toml. Two types:
TCP checks — Verifies a port is accepting connections. Minimal overhead, but doesn't confirm your application is responding correctly:
[[services.tcp_checks]]
grace_period = "1s"
interval = "15s"
restart_limit = 0
timeout = "2s"
HTTP checks — Makes an actual HTTP request and checks the status code. Much more meaningful:
[[services.http_checks]]
interval = 10000 # milliseconds
grace_period = "5s"
method = "get"
path = "/health"
protocol = "http"
timeout = 2000
tls_skip_verify = false
[services.http_checks.headers]
X-Forwarded-Proto = "https"
Fly uses these checks to decide whether a machine is healthy and whether to route traffic to it. If your health check fails, Fly pulls that machine out of rotation.
The health check path in your fly.toml should return a meaningful signal:
Minimal:
// Go
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"ok"}`))
})
# Python / Flask
@app.route('/health')
def health():
return {'status': 'ok'}, 200
With dependency checks:
@app.route('/health')
def health():
checks = {}
# Check database
try:
db.session.execute(text('SELECT 1'))
checks['database'] = 'ok'
except Exception as e:
checks['database'] = str(e)
# Check Redis if used
try:
redis_client.ping()
checks['redis'] = 'ok'
except Exception as e:
checks['redis'] = str(e)
all_ok = all(v == 'ok' for v in checks.values())
status_code = 200 if all_ok else 503
return jsonify({'status': 'ok' if all_ok else 'degraded', **checks}), status_code
Return 503 when dependencies are unhealthy — this signals to Fly to stop routing traffic to the machine.
Fly restarts machines that fail health checks. If your app has a consistent crash, you'll see machines restarting in rapid succession (a "crash loop"). The fly logs command shows you what's happening:
fly logs --app your-app-name
Watch for repeated start/stop cycles. A machine that restarts every few minutes indicates a crash loop — the application starts, fails a health check, gets restarted.
You can also get machine events:
fly machine list --app your-app-name
fly machine status <machine-id> --app your-app-name
One of Fly's selling points is running your application in multiple regions. This creates a monitoring consideration: your app might be working fine in the US while a region in Europe is failing.
A single-location uptime monitor would miss a regional failure. You need monitoring from multiple locations that corresponds to where your Fly machines are deployed.
Domain Monitor checks from multiple global locations simultaneously. If your Fly.io app is failing in one region but working in another, you'll see which locations are affected — giving you a much clearer picture than a single-point check. Create a free account to set up multi-location monitoring.
If your Fly app uses volumes for persistent storage, a volume filling up is a common failure mode. Fly doesn't alert you when volume usage is high.
Add disk usage to your health check:
import shutil
@app.route('/health')
def health():
disk = shutil.disk_usage('/')
disk_pct = disk.used / disk.total * 100
status = 'ok'
if disk_pct > 90:
status = 'warning'
if disk_pct > 95:
status = 'critical'
return jsonify({
'status': status,
'disk_used_pct': round(disk_pct, 1)
}), 200 if status != 'critical' else 503
Missing secrets are a common cause of Fly deployment failures. Fly secrets are set via fly secrets set, and a missing required secret will crash your application at startup.
Validate required environment variables on startup:
import os, sys
REQUIRED_VARS = ['DATABASE_URL', 'SECRET_KEY', 'REDIS_URL']
missing = [v for v in REQUIRED_VARS if not os.getenv(v)]
if missing:
print(f"ERROR: Missing required environment variables: {missing}", file=sys.stderr)
sys.exit(1)
This turns a confusing runtime error into an immediate, obvious startup failure that Fly will report clearly in the logs.
Fly's health checks handle internal routing. External monitoring handles user-perspective availability. Both are needed:
| What breaks | Fly health check catches it | External monitor catches it |
|---|---|---|
| Application crash | Yes (stops routing traffic) | Yes (no response) |
| Database down | Yes (if health endpoint checks DB) | Yes (500 errors or no response) |
| Regional network issue | No | Yes (specific location fails) |
| Fly platform issue | No | Yes |
| Slow responses (not failing) | No | Yes (response time threshold) |
Use Fly's checks to manage traffic routing. Use Domain Monitor to stay informed about user-facing availability from the outside.
See multi-location uptime monitoring for why location-distributed checks matter, and how to set up uptime monitoring for a complete setup guide.
A subdomain takeover lets an attacker claim your subdomain by exploiting dangling DNS records. Learn how it happens, real-world examples, and how DNS monitoring detects it.
Read moreMean time to detect (MTTD) measures how long it takes to discover an incident after it starts. Reducing MTTD is one of the highest-leverage improvements in reliability engineering.
Read moreBlack box monitoring tests your systems from the outside, the way users experience them — without access to internal code or infrastructure. Learn how it works and when to use it.
Read moreLooking to monitor your website and domains? Join our platform and start today.