Microservices monitoring dashboard showing inter-service health checks, circuit breaker states and distributed tracing
# website monitoring

Monitoring Microservices: Uptime and Health Across Distributed Services

Microservices trading the complexity of a monolith for the complexity of distribution. A monolith fails in one place; microservices can fail in many places simultaneously, fail partially, or fail in ways that cascade slowly through the system over minutes.

Monitoring a microservices architecture requires more effort and more layers than monitoring a single application — but the reward is more targeted visibility, faster root cause identification, and the ability to isolate failures to specific services.

The Microservices Monitoring Challenge

Partial Failures Are Normal

In a distributed system, some services will be unhealthy at any given moment. The question is whether the failure is isolated or cascading. A payment service timing out at p99 may be acceptable if your retry logic handles it gracefully — or it may be the early signal of a full outage if the queue is building up.

Monitoring must distinguish between:

  • Isolated failures — one service is degraded, the rest are healthy
  • Cascading failures — one service's failure is causing downstream services to fail
  • Full outages — multiple services are down simultaneously (often a shared dependency like the database or service mesh)

Service Discovery Complexity

In a microservices environment, services are dynamically provisioned, scaled, and replaced. IP addresses change. New service instances appear and disappear. Static monitoring configurations cannot keep up.

Modern monitoring for microservices requires integration with service discovery (Kubernetes, Consul, Eureka) or automatic service detection.

Network Failures Between Services

Microservices communicate over networks. Network partitions, DNS failures, and TLS handshake issues between services can cause partial failures that are invisible from outside the cluster. These require internal service-to-service health checks.

Structuring Health Endpoints for Microservices

Every service should expose a health endpoint. The pattern is consistent across services but contains service-specific checks.

Liveness vs Readiness

Kubernetes (and other orchestrators) distinguish between two health check types:

Liveness probe — Is this service alive and should it continue running? If the liveness probe fails, Kubernetes restarts the container.

GET /health/live

{
  "status": "ok"
}

This should be simple and fast — just confirm the process is running and responsive. Do not check external dependencies here; a database failure should not cause a restart loop.

Readiness probe — Is this service ready to receive traffic? If the readiness probe fails, Kubernetes removes the pod from the load balancer.

GET /health/ready

{
  "status": "ok",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "downstream_payment_service": "ok"
  }
}

Readiness probes should check dependencies. A service that cannot reach its database should not receive requests.

Deep Health Endpoints for External Monitoring

External monitoring tools should check a deeper health endpoint that validates end-to-end service functionality:

GET /health/deep

{
  "status": "ok",
  "service": "order-service",
  "version": "2.4.1",
  "uptime_seconds": 142800,
  "checks": {
    "database": {"status": "ok", "latency_ms": 4},
    "redis": {"status": "ok", "latency_ms": 1},
    "payment_service": {"status": "ok", "latency_ms": 23},
    "inventory_service": {"status": "ok", "latency_ms": 18}
  }
}

Domain Monitor can monitor these endpoints, alerting when any check returns non-200 or when the response body indicates a dependency failure.

External Monitoring: The User Perspective

Despite all internal monitoring, external availability monitoring remains essential. Internal health checks tell you what the services think about themselves. External monitoring tells you what users actually experience.

Configure external HTTP monitors for:

  • API gateway or load balancer endpoint — the user-facing entry point
  • Key API endpoints — authentication, core business operations
  • Health aggregator — a single endpoint that rolls up all service health into one response

The API gateway is particularly important in microservices. Even if all individual services are healthy, a misconfigured gateway, expired certificate, or routing change can make the entire system unreachable. External monitoring catches this when internal checks cannot.

Service Mesh Observability

If you use a service mesh (Istio, Linkerd, Consul Connect), you gain automatic collection of:

  • Request rate, error rate, and latency for every service-to-service call
  • Circuit breaker states
  • Retry counts and success rates

Service mesh observability is complementary to, not a replacement for, external availability monitoring. The service mesh sees only intra-cluster traffic; external monitoring sees the user-perspective.

Circuit Breaker Monitoring

Circuit breakers prevent cascading failures by stopping requests to a failing service. Monitor circuit breaker states:

  • Closed — service is healthy, requests are passing through
  • Open — service is failing, requests are short-circuited (failing fast)
  • Half-open — service is recovering, test requests are being tried

An open circuit breaker is an incident signal. Alert when any critical service's circuit breaker opens.

// Express health endpoint reporting circuit breaker state
app.get('/health/ready', (req, res) => {
  const paymentBreaker = circuitBreaker.getState('payment-service');

  if (paymentBreaker === 'OPEN') {
    return res.status(503).json({
      status: 'degraded',
      circuit_breakers: { payment_service: 'open' }
    });
  }

  res.json({ status: 'ok' });
});

Distributed Tracing for Root Cause Analysis

When a microservices incident occurs, distributed tracing tells you where in the call chain the failure originated. Tools like Jaeger, Zipkin, Tempo, or Datadog APM trace requests across service boundaries.

Combine distributed tracing with your external monitoring alerts:

  1. External monitor fires — API endpoint returning 503
  2. Open APM dashboard — look at traces for that endpoint during the outage window
  3. Trace shows: order-service → payment-service call timing out at 30 seconds
  4. Root cause identified: payment-service database connection pool exhausted

External monitoring is the alarm; distributed tracing is the diagnosis. See what is application performance monitoring for a deeper look at APM in this context.

Aggregated Health Dashboard

For microservices at scale, individual service monitors become unwieldy. Consider aggregating health data:

  • Kubernetes dashboard — Pod health, restart counts, resource utilisation
  • Service health aggregator — A service that polls all service health endpoints and exposes a combined status
  • Status page — A public or internal page showing the health of major service groups

External monitoring should check both individual services and the aggregated health endpoint.

Dependency Mapping

Know your service dependency graph. When service A fails, which services depend on it and will be affected? Draw or maintain a dependency map and use it to:

  • Prioritise monitoring (services with many dependents are highest priority)
  • Understand blast radius when an incident occurs
  • Design circuit breakers and fallbacks appropriately

See how to monitor third-party API dependencies for handling external service dependencies within this model.

Practical Monitoring Stack for Microservices

LayerToolWhat It Covers
External availabilityDomain MonitorUser-perspective HTTP, SSL, domain
Container orchestrationKubernetes health probesPod liveness and readiness
Service meshIstio/Linkerd metricsInter-service latency and errors
APMDatadog, New Relic, JaegerDistributed tracing, error rates
LogsELK, Loki, Datadog LogsError and event logs per service
AlertingPagerDuty, OpsgenieEscalation and on-call management

Start with external availability monitoring and Kubernetes health probes. Layer in APM and distributed tracing as your service count and traffic grow.


Monitor your microservices API gateway and key endpoints externally at Domain Monitor — detect user-facing failures regardless of what internal monitoring reports.

More posts

What Is Generative AI? How It Works and What It Creates

Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.

Read more
What Is Cursor AI? The AI Code Editor Explained

Cursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.

Read more
What Is Claude Opus? Anthropic's Most Powerful Model Explained

Claude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.

Read more

Subscribe to our PRO plan.

Looking to monitor your website and domains? Join our platform and start today.