Global map showing Fly.io edge deployment regions with monitoring status indicators and response time metrics
# developer tools# website monitoring

How to Monitor Fly.io Apps and Health Checks

Fly.io is a compelling platform — it runs your Docker containers close to users globally, scales machines automatically, and handles a lot of infrastructure concerns for you. It's increasingly popular for applications that need low latency globally without the complexity of managing multi-region infrastructure yourself.

Like every hosting platform, though, it has blind spots. Fly's internal health checks and metrics tell you about your machines. External monitoring tells you what users actually experience.

How Fly.io Health Checks Work

Fly.io supports health checks configured in your fly.toml. Two types:

TCP checks — Verifies a port is accepting connections. Minimal overhead, but doesn't confirm your application is responding correctly:

[[services.tcp_checks]]
  grace_period = "1s"
  interval = "15s"
  restart_limit = 0
  timeout = "2s"

HTTP checks — Makes an actual HTTP request and checks the status code. Much more meaningful:

[[services.http_checks]]
  interval = 10000        # milliseconds
  grace_period = "5s"
  method = "get"
  path = "/health"
  protocol = "http"
  timeout = 2000
  tls_skip_verify = false
  [services.http_checks.headers]
    X-Forwarded-Proto = "https"

Fly uses these checks to decide whether a machine is healthy and whether to route traffic to it. If your health check fails, Fly pulls that machine out of rotation.

Building a Meaningful Health Endpoint

The health check path in your fly.toml should return a meaningful signal:

Minimal:

// Go
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(http.StatusOK)
    w.Write([]byte(`{"status":"ok"}`))
})
# Python / Flask
@app.route('/health')
def health():
    return {'status': 'ok'}, 200

With dependency checks:

@app.route('/health')
def health():
    checks = {}

    # Check database
    try:
        db.session.execute(text('SELECT 1'))
        checks['database'] = 'ok'
    except Exception as e:
        checks['database'] = str(e)

    # Check Redis if used
    try:
        redis_client.ping()
        checks['redis'] = 'ok'
    except Exception as e:
        checks['redis'] = str(e)

    all_ok = all(v == 'ok' for v in checks.values())
    status_code = 200 if all_ok else 503
    return jsonify({'status': 'ok' if all_ok else 'degraded', **checks}), status_code

Return 503 when dependencies are unhealthy — this signals to Fly to stop routing traffic to the machine.

Fly.io Machine Restarts and Alerts

Fly restarts machines that fail health checks. If your app has a consistent crash, you'll see machines restarting in rapid succession (a "crash loop"). The fly logs command shows you what's happening:

fly logs --app your-app-name

Watch for repeated start/stop cycles. A machine that restarts every few minutes indicates a crash loop — the application starts, fails a health check, gets restarted.

You can also get machine events:

fly machine list --app your-app-name
fly machine status <machine-id> --app your-app-name

Multi-Region Monitoring

One of Fly's selling points is running your application in multiple regions. This creates a monitoring consideration: your app might be working fine in the US while a region in Europe is failing.

A single-location uptime monitor would miss a regional failure. You need monitoring from multiple locations that corresponds to where your Fly machines are deployed.

Domain Monitor checks from multiple global locations simultaneously. If your Fly.io app is failing in one region but working in another, you'll see which locations are affected — giving you a much clearer picture than a single-point check. Create a free account to set up multi-location monitoring.

Volume Mounts and Persistent Storage

If your Fly app uses volumes for persistent storage, a volume filling up is a common failure mode. Fly doesn't alert you when volume usage is high.

Add disk usage to your health check:

import shutil

@app.route('/health')
def health():
    disk = shutil.disk_usage('/')
    disk_pct = disk.used / disk.total * 100

    status = 'ok'
    if disk_pct > 90:
        status = 'warning'
    if disk_pct > 95:
        status = 'critical'

    return jsonify({
        'status': status,
        'disk_used_pct': round(disk_pct, 1)
    }), 200 if status != 'critical' else 503

Secrets and Environment Variables

Missing secrets are a common cause of Fly deployment failures. Fly secrets are set via fly secrets set, and a missing required secret will crash your application at startup.

Validate required environment variables on startup:

import os, sys

REQUIRED_VARS = ['DATABASE_URL', 'SECRET_KEY', 'REDIS_URL']
missing = [v for v in REQUIRED_VARS if not os.getenv(v)]

if missing:
    print(f"ERROR: Missing required environment variables: {missing}", file=sys.stderr)
    sys.exit(1)

This turns a confusing runtime error into an immediate, obvious startup failure that Fly will report clearly in the logs.

Combining Fly Health Checks With External Monitoring

Fly's health checks handle internal routing. External monitoring handles user-perspective availability. Both are needed:

What breaksFly health check catches itExternal monitor catches it
Application crashYes (stops routing traffic)Yes (no response)
Database downYes (if health endpoint checks DB)Yes (500 errors or no response)
Regional network issueNoYes (specific location fails)
Fly platform issueNoYes
Slow responses (not failing)NoYes (response time threshold)

Use Fly's checks to manage traffic routing. Use Domain Monitor to stay informed about user-facing availability from the outside.

Also in This Series

See multi-location uptime monitoring for why location-distributed checks matter, and how to set up uptime monitoring for a complete setup guide.

More posts

Why Your Status Page Matters During an Outage

When your site goes down, your status page becomes the most important page you have. Here's why it matters, what happens when you don't have one, and what a good status page does during a real outage.

Read more
Why Your Domain Points to the Wrong Server

Your domain is resolving, but pointing to the wrong server — showing old content, a previous host's page, or someone else's site entirely. Here's what causes this and how to diagnose it.

Read more
Why Website Monitoring Misses Downtime Sometimes

Uptime monitoring isn't foolproof. Single-location monitors, wrong health check endpoints, long check intervals, and false positives can all cause real downtime to go undetected. Here's what to watch out for.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.