Microservices architecture monitoring diagram showing service health and dependency connections
# web development

Microservices Monitoring: How to Track Uptime Across Distributed Systems

Microservices architectures have real benefits — independent deployments, technology flexibility, team autonomy. But they also introduce a monitoring challenge that monoliths never had: you now have dozens or hundreds of things that can break independently, and when one breaks, the effects ripple through the entire system in ways that are hard to predict.

A single database connection pool exhausted in one service can bring down a checkout flow that touches four other services. A slow external API call in a payment service can block threads in an order service, which causes the product service to time out, which makes your homepage load slowly. The chain of failures is invisible unless you're monitoring at every link.

This guide covers what microservices monitoring actually requires and how to implement it effectively.

The Core Challenge: Distributed Failure

In a monolith, when something breaks, you look at one application, one set of logs, one set of metrics. In microservices:

  • Error source is hidden — users see a 500 error on the frontend, but the actual failure is in a downstream service three hops away
  • Partial failures are normal — some services may be degraded while others work fine
  • Cascading failures — one slow service causes backpressure across dependent services
  • Version mismatches — different services deploy at different times; API contract changes can silently break integrations

Effective microservices monitoring means treating the system as the unit of observation, not individual services.

Every Service Needs a Health Endpoint

The most fundamental requirement for microservices monitoring is that every service exposes a /health endpoint. This endpoint should:

  1. Return 200 OK if the service is healthy
  2. Return 503 Service Unavailable if the service is degraded or unable to serve traffic
  3. Include information about the service's own health and the health of its critical dependencies

A well-designed health endpoint:

GET /health
{
  "status": "healthy",
  "service": "order-service",
  "version": "2.4.1",
  "dependencies": {
    "database": "healthy",
    "payment-service": "healthy",
    "inventory-service": "degraded"
  },
  "uptime_seconds": 86400
}

This gives you immediate visibility into dependency health without digging through logs.

Shallow vs Deep Health Checks

Some teams distinguish between two types of health checks:

  • Shallow (liveness) check — is the process running and able to accept connections? Used by orchestrators to decide whether to restart.
  • Deep (readiness) check — are all dependencies healthy and is the service ready to handle traffic? Used by load balancers to decide whether to route traffic.

In Kubernetes terms, this maps to livenessProbe and readinessProbe.

External Monitoring for Microservices

For HTTP-exposed services — APIs, web frontends, gateways — external uptime monitoring is essential. It checks your service from outside your cluster, verifying the full network path is working.

With Domain Monitor, you can add monitors for each of your critical services:

  • API Gateway — the entry point for external traffic
  • Public-facing services — any service with an external URL
  • Health aggregation endpoints — if you have a service that aggregates health from all downstream services

External monitoring catches problems that internal health checks miss:

  • Kubernetes ingress misconfiguration
  • Load balancer failures
  • SSL certificate expiry
  • DNS failures
  • Network routing issues

For more on why external monitoring matters, see what is website monitoring.

Distributed Tracing: Following Requests Across Services

When a request fails in a microservices system, you need to know which service caused the failure and why. This is what distributed tracing is for.

Distributed tracing works by attaching a unique trace ID to each incoming request. As the request flows through services, each service:

  1. Reads the trace ID from incoming headers
  2. Creates a span recording what it did and how long it took
  3. Passes the trace ID to any downstream service calls

The result is a complete picture of everything that happened during a single request, across all services.

Popular distributed tracing tools include:

  • Jaeger — open source, originally from Uber
  • Zipkin — open source, originated at Twitter
  • OpenTelemetry — the emerging standard for instrumentation

OpenTelemetry is particularly important — it's a vendor-neutral instrumentation standard that lets you collect traces, metrics, and logs once and send them anywhere. Most major monitoring vendors now support OpenTelemetry.

Service Mesh Monitoring

A service mesh (like Istio, Linkerd, or Consul Connect) is a dedicated infrastructure layer that handles service-to-service communication. Service meshes provide:

  • Automatic mTLS — encrypted and authenticated service-to-service traffic
  • Traffic observability — every service call is observed at the mesh level
  • Circuit breakers — automatically stop sending traffic to unhealthy services
  • Retries and timeouts — consistent retry/timeout policies across all services

From a monitoring perspective, a service mesh gives you free observability — you see success rates, latency, and traffic volume for every service-to-service connection without instrumenting your application code.

If you're running Kubernetes and have the complexity to justify it, a service mesh dramatically simplifies microservices monitoring.

Circuit Breakers and What They Tell You

The circuit breaker pattern prevents cascading failures by "opening" (stopping requests to) a service that's repeatedly failing. When a circuit breaker opens, requests fail fast instead of waiting for a timeout.

Monitoring circuit breaker state is crucial for microservices:

  • Closed — normal operation, requests flowing through
  • Open — service is failing, requests are being rejected immediately
  • Half-open — testing whether the service has recovered

A circuit breaker opening is a signal that you should investigate. It means something downstream is degraded, and your system is protecting itself.

Libraries like Netflix Hystrix (Java), Polly (.NET), and opossum (Node.js) implement the circuit breaker pattern.

Monitoring Upstream and Downstream Dependencies

Every microservice has dependencies — services it calls and services that call it. Map these dependencies and monitor each link:

What to monitor for each dependency:

  • Error rate — what % of calls to this dependency are failing?
  • Latency — is the dependency getting slower?
  • Availability — is the dependency responding at all?
  • Throughput — is traffic to this dependency unusually high or low?

Unusually low traffic to a downstream service can be as significant as high error rates — it might mean no requests are reaching it due to a routing problem.

For monitoring third-party external dependencies specifically, see how to monitor third-party services.

Aggregated Health and Status Pages

With many services, you need an aggregated view of system health. This might be:

  1. Internal dashboard — a tool like Grafana showing health status for all services
  2. Public status page — a page that tells users whether the system is operational

For customer-facing applications, a public status page is essential. When microservices issues cause visible problems, customers shouldn't be left wondering if your product is down. Read about public status pages and status page alternatives.

Key Metrics for Each Microservice

Use the RED Method for service-level metrics:

  • R — Rate: requests per second
  • E — Errors: error rate (% of requests failing)
  • D — Duration: latency distribution (P50, P95, P99)

These three metrics, tracked for every service, give you a solid foundation for understanding service health and detecting degradation.

Checklist: Microservices Monitoring Setup

  • Every service has a /health endpoint (shallow + deep checks)
  • External monitoring on all public-facing services and API gateways
  • Distributed tracing implemented (OpenTelemetry recommended)
  • Service dependency map documented and monitored
  • Circuit breakers in place for critical dependencies
  • RED metrics (Rate, Errors, Duration) tracked for each service
  • Aggregated health dashboard available for on-call engineers
  • Public status page for customer-facing incidents
  • Alert routing: right alerts go to the right teams

For handling incidents when they occur, see the website incident response plan guide.

Wrapping Up

Monitoring microservices is not harder than monitoring monoliths — it's different. The same principles apply (know when things break, understand why, fix them fast), but the tooling and patterns are different.

Start with the fundamentals: health endpoints on every service, external monitoring on all public endpoints, and structured logging. Add distributed tracing and service mesh visibility as your system grows. The goal is always the same: know about problems before your users do.

Domain Monitor provides the external monitoring layer — the piece that sees your services the way users do. Add your critical endpoints today and build from there.

More posts

What Is Generative AI? How It Works and What It Creates

Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.

Read more
What Is Cursor AI? The AI Code Editor Explained

Cursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.

Read more
What Is Claude Opus? Anthropic's Most Powerful Model Explained

Claude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.