Monitoring Microservices: Uptime and Health Across Distributed Services

Microservices trading the complexity of a monolith for the complexity of distribution. A monolith fails in one place; microservices can fail in many places simultaneously, fail partially, or fail in ways that cascade slowly through the system over minutes.

Monitoring a microservices architecture requires more effort and more layers than monitoring a single application — but the reward is more targeted visibility, faster root cause identification, and the ability to isolate failures to specific services.

The Microservices Monitoring Challenge

Partial Failures Are Normal

In a distributed system, some services will be unhealthy at any given moment. The question is whether the failure is isolated or cascading. A payment service timing out at p99 may be acceptable if your retry logic handles it gracefully — or it may be the early signal of a full outage if the queue is building up.

Monitoring must distinguish between:

Isolated failures — one service is degraded, the rest are healthy
Cascading failures — one service's failure is causing downstream services to fail
Full outages — multiple services are down simultaneously (often a shared dependency like the database or service mesh)

Service Discovery Complexity

In a microservices environment, services are dynamically provisioned, scaled, and replaced. IP addresses change. New service instances appear and disappear. Static monitoring configurations cannot keep up.

Modern monitoring for microservices requires integration with service discovery (Kubernetes, Consul, Eureka) or automatic service detection.

Network Failures Between Services

Microservices communicate over networks. Network partitions, DNS failures, and TLS handshake issues between services can cause partial failures that are invisible from outside the cluster. These require internal service-to-service health checks.

Structuring Health Endpoints for Microservices

Every service should expose a health endpoint. The pattern is consistent across services but contains service-specific checks.

Liveness vs Readiness

Kubernetes (and other orchestrators) distinguish between two health check types:

Liveness probe — Is this service alive and should it continue running? If the liveness probe fails, Kubernetes restarts the container.

GET /health/live

{
  "status": "ok"
}

This should be simple and fast — just confirm the process is running and responsive. Do not check external dependencies here; a database failure should not cause a restart loop.

Readiness probe — Is this service ready to receive traffic? If the readiness probe fails, Kubernetes removes the pod from the load balancer.

GET /health/ready

{
  "status": "ok",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "downstream_payment_service": "ok"
  }
}

Readiness probes should check dependencies. A service that cannot reach its database should not receive requests.

Deep Health Endpoints for External Monitoring

External monitoring tools should check a deeper health endpoint that validates end-to-end service functionality:

GET /health/deep

{
  "status": "ok",
  "service": "order-service",
  "version": "2.4.1",
  "uptime_seconds": 142800,
  "checks": {
    "database": {"status": "ok", "latency_ms": 4},
    "redis": {"status": "ok", "latency_ms": 1},
    "payment_service": {"status": "ok", "latency_ms": 23},
    "inventory_service": {"status": "ok", "latency_ms": 18}
  }
}

Domain Monitor can monitor these endpoints, alerting when any check returns non-200 or when the response body indicates a dependency failure.

External Monitoring: The User Perspective

Despite all internal monitoring, external availability monitoring remains essential. Internal health checks tell you what the services think about themselves. External monitoring tells you what users actually experience.

Configure external HTTP monitors for:

API gateway or load balancer endpoint — the user-facing entry point
Key API endpoints — authentication, core business operations
Health aggregator — a single endpoint that rolls up all service health into one response

The API gateway is particularly important in microservices. Even if all individual services are healthy, a misconfigured gateway, expired certificate, or routing change can make the entire system unreachable. External monitoring catches this when internal checks cannot.

Service Mesh Observability

If you use a service mesh (Istio, Linkerd, Consul Connect), you gain automatic collection of:

Request rate, error rate, and latency for every service-to-service call
Circuit breaker states
Retry counts and success rates

Service mesh observability is complementary to, not a replacement for, external availability monitoring. The service mesh sees only intra-cluster traffic; external monitoring sees the user-perspective.

Circuit Breaker Monitoring

Circuit breakers prevent cascading failures by stopping requests to a failing service. Monitor circuit breaker states:

Closed — service is healthy, requests are passing through
Open — service is failing, requests are short-circuited (failing fast)
Half-open — service is recovering, test requests are being tried

An open circuit breaker is an incident signal. Alert when any critical service's circuit breaker opens.

// Express health endpoint reporting circuit breaker state
app.get('/health/ready', (req, res) => {
  const paymentBreaker = circuitBreaker.getState('payment-service');

  if (paymentBreaker === 'OPEN') {
    return res.status(503).json({
      status: 'degraded',
      circuit_breakers: { payment_service: 'open' }
    });
  }

  res.json({ status: 'ok' });
});

Distributed Tracing for Root Cause Analysis

When a microservices incident occurs, distributed tracing tells you where in the call chain the failure originated. Tools like Jaeger, Zipkin, Tempo, or Datadog APM trace requests across service boundaries.

Combine distributed tracing with your external monitoring alerts:

External monitor fires — API endpoint returning 503
Open APM dashboard — look at traces for that endpoint during the outage window
Trace shows: order-service → payment-service call timing out at 30 seconds
Root cause identified: payment-service database connection pool exhausted

External monitoring is the alarm; distributed tracing is the diagnosis. See what is application performance monitoring for a deeper look at APM in this context.

Aggregated Health Dashboard

For microservices at scale, individual service monitors become unwieldy. Consider aggregating health data:

Kubernetes dashboard — Pod health, restart counts, resource utilisation
Service health aggregator — A service that polls all service health endpoints and exposes a combined status
Status page — A public or internal page showing the health of major service groups

External monitoring should check both individual services and the aggregated health endpoint.

Dependency Mapping

Know your service dependency graph. When service A fails, which services depend on it and will be affected? Draw or maintain a dependency map and use it to:

Prioritise monitoring (services with many dependents are highest priority)
Understand blast radius when an incident occurs
Design circuit breakers and fallbacks appropriately

See how to monitor third-party API dependencies for handling external service dependencies within this model.

Practical Monitoring Stack for Microservices

Layer	Tool	What It Covers
External availability	Domain Monitor	User-perspective HTTP, SSL, domain
Container orchestration	Kubernetes health probes	Pod liveness and readiness
Service mesh	Istio/Linkerd metrics	Inter-service latency and errors
APM	Datadog, New Relic, Jaeger	Distributed tracing, error rates
Logs	ELK, Loki, Datadog Logs	Error and event logs per service
Alerting	PagerDuty, Opsgenie	Escalation and on-call management

Start with external availability monitoring and Kubernetes health probes. Layer in APM and distributed tracing as your service count and traffic grow.

Monitor your microservices API gateway and key endpoints externally at Domain Monitor — detect user-facing failures regardless of what internal monitoring reports.

Monitoring Microservices: Uptime and Health Across Distributed Services

The Microservices Monitoring Challenge

Partial Failures Are Normal

Service Discovery Complexity

Network Failures Between Services

Structuring Health Endpoints for Microservices

Liveness vs Readiness

Deep Health Endpoints for External Monitoring

External Monitoring: The User Perspective

Service Mesh Observability

Circuit Breaker Monitoring

Distributed Tracing for Root Cause Analysis

Aggregated Health Dashboard

Dependency Mapping

Practical Monitoring Stack for Microservices

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.

Domain Monitoring

Uptime Monitoring

SSL Monitoring

WHOIS Lookup

Notifications

Status Pages

Ping test

Traceroute test

Find my website's IP

# website monitoring

Monitoring Microservices: Uptime and Health Across Distributed Services

The Microservices Monitoring Challenge

Partial Failures Are Normal

Service Discovery Complexity

Network Failures Between Services

Structuring Health Endpoints for Microservices

Liveness vs Readiness

Deep Health Endpoints for External Monitoring

External Monitoring: The User Perspective

Service Mesh Observability

Circuit Breaker Monitoring

Distributed Tracing for Root Cause Analysis

Aggregated Health Dashboard

Dependency Mapping

Practical Monitoring Stack for Microservices

Related Articles

More posts

What Is a Subdomain Takeover and How to Prevent It

What Is Mean Time to Detect (MTTD)?

What Is Black Box Monitoring?

Subscribe to our PRO plan.