How to Monitor Fly.io Apps and Health Checks

Fly.io is a compelling platform — it runs your Docker containers close to users globally, scales machines automatically, and handles a lot of infrastructure concerns for you. It's increasingly popular for applications that need low latency globally without the complexity of managing multi-region infrastructure yourself.

Like every hosting platform, though, it has blind spots. Fly's internal health checks and metrics tell you about your machines. External monitoring tells you what users actually experience.

How Fly.io Health Checks Work

Fly.io supports health checks configured in your fly.toml. Two types:

TCP checks — Verifies a port is accepting connections. Minimal overhead, but doesn't confirm your application is responding correctly:

[[services.tcp_checks]]
  grace_period = "1s"
  interval = "15s"
  restart_limit = 0
  timeout = "2s"

HTTP checks — Makes an actual HTTP request and checks the status code. Much more meaningful:

[[services.http_checks]]
  interval = 10000        # milliseconds
  grace_period = "5s"
  method = "get"
  path = "/health"
  protocol = "http"
  timeout = 2000
  tls_skip_verify = false
  [services.http_checks.headers]
    X-Forwarded-Proto = "https"

Fly uses these checks to decide whether a machine is healthy and whether to route traffic to it. If your health check fails, Fly pulls that machine out of rotation.

Building a Meaningful Health Endpoint

The health check path in your fly.toml should return a meaningful signal:

Minimal:

// Go
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(http.StatusOK)
    w.Write([]byte(`{"status":"ok"}`))
})

# Python / Flask
@app.route('/health')
def health():
    return {'status': 'ok'}, 200

With dependency checks:

@app.route('/health')
def health():
    checks = {}

    # Check database
    try:
        db.session.execute(text('SELECT 1'))
        checks['database'] = 'ok'
    except Exception as e:
        checks['database'] = str(e)

    # Check Redis if used
    try:
        redis_client.ping()
        checks['redis'] = 'ok'
    except Exception as e:
        checks['redis'] = str(e)

    all_ok = all(v == 'ok' for v in checks.values())
    status_code = 200 if all_ok else 503
    return jsonify({'status': 'ok' if all_ok else 'degraded', **checks}), status_code

Return 503 when dependencies are unhealthy — this signals to Fly to stop routing traffic to the machine.

Fly.io Machine Restarts and Alerts

Fly restarts machines that fail health checks. If your app has a consistent crash, you'll see machines restarting in rapid succession (a "crash loop"). The fly logs command shows you what's happening:

fly logs --app your-app-name

Watch for repeated start/stop cycles. A machine that restarts every few minutes indicates a crash loop — the application starts, fails a health check, gets restarted.

You can also get machine events:

fly machine list --app your-app-name
fly machine status <machine-id> --app your-app-name

Multi-Region Monitoring

One of Fly's selling points is running your application in multiple regions. This creates a monitoring consideration: your app might be working fine in the US while a region in Europe is failing.

A single-location uptime monitor would miss a regional failure. You need monitoring from multiple locations that corresponds to where your Fly machines are deployed.

Domain Monitor checks from multiple global locations simultaneously. If your Fly.io app is failing in one region but working in another, you'll see which locations are affected — giving you a much clearer picture than a single-point check. Create a free account to set up multi-location monitoring.

Volume Mounts and Persistent Storage

If your Fly app uses volumes for persistent storage, a volume filling up is a common failure mode. Fly doesn't alert you when volume usage is high.

Add disk usage to your health check:

import shutil

@app.route('/health')
def health():
    disk = shutil.disk_usage('/')
    disk_pct = disk.used / disk.total * 100

    status = 'ok'
    if disk_pct > 90:
        status = 'warning'
    if disk_pct > 95:
        status = 'critical'

    return jsonify({
        'status': status,
        'disk_used_pct': round(disk_pct, 1)
    }), 200 if status != 'critical' else 503

Secrets and Environment Variables

Missing secrets are a common cause of Fly deployment failures. Fly secrets are set via fly secrets set, and a missing required secret will crash your application at startup.

Validate required environment variables on startup:

import os, sys

REQUIRED_VARS = ['DATABASE_URL', 'SECRET_KEY', 'REDIS_URL']
missing = [v for v in REQUIRED_VARS if not os.getenv(v)]

if missing:
    print(f"ERROR: Missing required environment variables: {missing}", file=sys.stderr)
    sys.exit(1)

This turns a confusing runtime error into an immediate, obvious startup failure that Fly will report clearly in the logs.

Combining Fly Health Checks With External Monitoring

Fly's health checks handle internal routing. External monitoring handles user-perspective availability. Both are needed:

What breaks	Fly health check catches it	External monitor catches it
Application crash	Yes (stops routing traffic)	Yes (no response)
Database down	Yes (if health endpoint checks DB)	Yes (500 errors or no response)
Regional network issue	No	Yes (specific location fails)
Fly platform issue	No	Yes
Slow responses (not failing)	No	Yes (response time threshold)

Use Fly's checks to manage traffic routing. Use Domain Monitor to stay informed about user-facing availability from the outside.

Also in This Series

See multi-location uptime monitoring for why location-distributed checks matter, and how to set up uptime monitoring for a complete setup guide.

How to Monitor Fly.io Apps and Health Checks

How Fly.io Health Checks Work

Building a Meaningful Health Endpoint

Fly.io Machine Restarts and Alerts

Multi-Region Monitoring

Volume Mounts and Persistent Storage

Secrets and Environment Variables

Combining Fly Health Checks With External Monitoring

Also in This Series

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.

Domain Monitoring

Uptime Monitoring

SSL Monitoring

WHOIS Lookup

Notifications

Status Pages

Ping test

Traceroute test

Find my website's IP

# developer tools# website monitoring

How to Monitor Fly.io Apps and Health Checks

How Fly.io Health Checks Work

Building a Meaningful Health Endpoint

Fly.io Machine Restarts and Alerts

Multi-Region Monitoring

Volume Mounts and Persistent Storage

Secrets and Environment Variables

Combining Fly Health Checks With External Monitoring

Also in This Series

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.