Microservices Monitoring: How to Track Uptime Across Distributed Systems

Microservices architectures have real benefits — independent deployments, technology flexibility, team autonomy. But they also introduce a monitoring challenge that monoliths never had: you now have dozens or hundreds of things that can break independently, and when one breaks, the effects ripple through the entire system in ways that are hard to predict.

A single database connection pool exhausted in one service can bring down a checkout flow that touches four other services. A slow external API call in a payment service can block threads in an order service, which causes the product service to time out, which makes your homepage load slowly. The chain of failures is invisible unless you're monitoring at every link.

This guide covers what microservices monitoring actually requires and how to implement it effectively.

The Core Challenge: Distributed Failure

In a monolith, when something breaks, you look at one application, one set of logs, one set of metrics. In microservices:

Error source is hidden — users see a 500 error on the frontend, but the actual failure is in a downstream service three hops away
Partial failures are normal — some services may be degraded while others work fine
Cascading failures — one slow service causes backpressure across dependent services
Version mismatches — different services deploy at different times; API contract changes can silently break integrations

Effective microservices monitoring means treating the system as the unit of observation, not individual services.

Every Service Needs a Health Endpoint

The most fundamental requirement for microservices monitoring is that every service exposes a /health endpoint. This endpoint should:

Return 200 OK if the service is healthy
Return 503 Service Unavailable if the service is degraded or unable to serve traffic
Include information about the service's own health and the health of its critical dependencies

A well-designed health endpoint:

GET /health
{
  "status": "healthy",
  "service": "order-service",
  "version": "2.4.1",
  "dependencies": {
    "database": "healthy",
    "payment-service": "healthy",
    "inventory-service": "degraded"
  },
  "uptime_seconds": 86400
}

This gives you immediate visibility into dependency health without digging through logs.

Shallow vs Deep Health Checks

Some teams distinguish between two types of health checks:

Shallow (liveness) check — is the process running and able to accept connections? Used by orchestrators to decide whether to restart.
Deep (readiness) check — are all dependencies healthy and is the service ready to handle traffic? Used by load balancers to decide whether to route traffic.

In Kubernetes terms, this maps to livenessProbe and readinessProbe.

External Monitoring for Microservices

For HTTP-exposed services — APIs, web frontends, gateways — external uptime monitoring is essential. It checks your service from outside your cluster, verifying the full network path is working.

With Domain Monitor, you can add monitors for each of your critical services:

API Gateway — the entry point for external traffic
Public-facing services — any service with an external URL
Health aggregation endpoints — if you have a service that aggregates health from all downstream services

External monitoring catches problems that internal health checks miss:

Kubernetes ingress misconfiguration
Load balancer failures
SSL certificate expiry
DNS failures
Network routing issues

For more on why external monitoring matters, see what is website monitoring.

Distributed Tracing: Following Requests Across Services

When a request fails in a microservices system, you need to know which service caused the failure and why. This is what distributed tracing is for.

Distributed tracing works by attaching a unique trace ID to each incoming request. As the request flows through services, each service:

Reads the trace ID from incoming headers
Creates a span recording what it did and how long it took
Passes the trace ID to any downstream service calls

The result is a complete picture of everything that happened during a single request, across all services.

Popular distributed tracing tools include:

Jaeger — open source, originally from Uber
Zipkin — open source, originated at Twitter
OpenTelemetry — the emerging standard for instrumentation

OpenTelemetry is particularly important — it's a vendor-neutral instrumentation standard that lets you collect traces, metrics, and logs once and send them anywhere. Most major monitoring vendors now support OpenTelemetry.

Service Mesh Monitoring

A service mesh (like Istio, Linkerd, or Consul Connect) is a dedicated infrastructure layer that handles service-to-service communication. Service meshes provide:

Automatic mTLS — encrypted and authenticated service-to-service traffic
Traffic observability — every service call is observed at the mesh level
Circuit breakers — automatically stop sending traffic to unhealthy services
Retries and timeouts — consistent retry/timeout policies across all services

From a monitoring perspective, a service mesh gives you free observability — you see success rates, latency, and traffic volume for every service-to-service connection without instrumenting your application code.

If you're running Kubernetes and have the complexity to justify it, a service mesh dramatically simplifies microservices monitoring.

Circuit Breakers and What They Tell You

The circuit breaker pattern prevents cascading failures by "opening" (stopping requests to) a service that's repeatedly failing. When a circuit breaker opens, requests fail fast instead of waiting for a timeout.

Monitoring circuit breaker state is crucial for microservices:

Closed — normal operation, requests flowing through
Open — service is failing, requests are being rejected immediately
Half-open — testing whether the service has recovered

A circuit breaker opening is a signal that you should investigate. It means something downstream is degraded, and your system is protecting itself.

Libraries like Netflix Hystrix (Java), Polly (.NET), and opossum (Node.js) implement the circuit breaker pattern.

Monitoring Upstream and Downstream Dependencies

Every microservice has dependencies — services it calls and services that call it. Map these dependencies and monitor each link:

What to monitor for each dependency:

Error rate — what % of calls to this dependency are failing?
Latency — is the dependency getting slower?
Availability — is the dependency responding at all?
Throughput — is traffic to this dependency unusually high or low?

Unusually low traffic to a downstream service can be as significant as high error rates — it might mean no requests are reaching it due to a routing problem.

For monitoring third-party external dependencies specifically, see how to monitor third-party services.

Aggregated Health and Status Pages

With many services, you need an aggregated view of system health. This might be:

Internal dashboard — a tool like Grafana showing health status for all services
Public status page — a page that tells users whether the system is operational

For customer-facing applications, a public status page is essential. When microservices issues cause visible problems, customers shouldn't be left wondering if your product is down. Read about public status pages and status page alternatives.

Key Metrics for Each Microservice

Use the RED Method for service-level metrics:

R — Rate: requests per second
E — Errors: error rate (% of requests failing)
D — Duration: latency distribution (P50, P95, P99)

These three metrics, tracked for every service, give you a solid foundation for understanding service health and detecting degradation.

Checklist: Microservices Monitoring Setup

Every service has a /health endpoint (shallow + deep checks)
External monitoring on all public-facing services and API gateways
Distributed tracing implemented (OpenTelemetry recommended)
Service dependency map documented and monitored
Circuit breakers in place for critical dependencies
RED metrics (Rate, Errors, Duration) tracked for each service
Aggregated health dashboard available for on-call engineers
Public status page for customer-facing incidents
Alert routing: right alerts go to the right teams

For handling incidents when they occur, see the website incident response plan guide.

Wrapping Up

Monitoring microservices is not harder than monitoring monoliths — it's different. The same principles apply (know when things break, understand why, fix them fast), but the tooling and patterns are different.

Start with the fundamentals: health endpoints on every service, external monitoring on all public endpoints, and structured logging. Add distributed tracing and service mesh visibility as your system grows. The goal is always the same: know about problems before your users do.

Domain Monitor provides the external monitoring layer — the piece that sees your services the way users do. Add your critical endpoints today and build from there.

Microservices Monitoring: How to Track Uptime Across Distributed Systems

The Core Challenge: Distributed Failure

Every Service Needs a Health Endpoint

Shallow vs Deep Health Checks

External Monitoring for Microservices

Distributed Tracing: Following Requests Across Services

Service Mesh Monitoring

Circuit Breakers and What They Tell You

Monitoring Upstream and Downstream Dependencies

What to monitor for each dependency:

Aggregated Health and Status Pages

Key Metrics for Each Microservice

Checklist: Microservices Monitoring Setup

Wrapping Up

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.

Domain Monitoring

Uptime Monitoring

SSL Monitoring

WHOIS Lookup

Notifications

Status Pages

Ping test

Traceroute test

Find my website's IP

# web development

Microservices Monitoring: How to Track Uptime Across Distributed Systems

The Core Challenge: Distributed Failure

Every Service Needs a Health Endpoint

Shallow vs Deep Health Checks

External Monitoring for Microservices

Distributed Tracing: Following Requests Across Services

Service Mesh Monitoring

Circuit Breakers and What They Tell You

Monitoring Upstream and Downstream Dependencies

What to monitor for each dependency:

Aggregated Health and Status Pages

Key Metrics for Each Microservice

Checklist: Microservices Monitoring Setup

Wrapping Up

Related Articles

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.