
Microservices trading the complexity of a monolith for the complexity of distribution. A monolith fails in one place; microservices can fail in many places simultaneously, fail partially, or fail in ways that cascade slowly through the system over minutes.
Monitoring a microservices architecture requires more effort and more layers than monitoring a single application — but the reward is more targeted visibility, faster root cause identification, and the ability to isolate failures to specific services.
In a distributed system, some services will be unhealthy at any given moment. The question is whether the failure is isolated or cascading. A payment service timing out at p99 may be acceptable if your retry logic handles it gracefully — or it may be the early signal of a full outage if the queue is building up.
Monitoring must distinguish between:
In a microservices environment, services are dynamically provisioned, scaled, and replaced. IP addresses change. New service instances appear and disappear. Static monitoring configurations cannot keep up.
Modern monitoring for microservices requires integration with service discovery (Kubernetes, Consul, Eureka) or automatic service detection.
Microservices communicate over networks. Network partitions, DNS failures, and TLS handshake issues between services can cause partial failures that are invisible from outside the cluster. These require internal service-to-service health checks.
Every service should expose a health endpoint. The pattern is consistent across services but contains service-specific checks.
Kubernetes (and other orchestrators) distinguish between two health check types:
Liveness probe — Is this service alive and should it continue running? If the liveness probe fails, Kubernetes restarts the container.
GET /health/live
{
"status": "ok"
}
This should be simple and fast — just confirm the process is running and responsive. Do not check external dependencies here; a database failure should not cause a restart loop.
Readiness probe — Is this service ready to receive traffic? If the readiness probe fails, Kubernetes removes the pod from the load balancer.
GET /health/ready
{
"status": "ok",
"checks": {
"database": "ok",
"redis": "ok",
"downstream_payment_service": "ok"
}
}
Readiness probes should check dependencies. A service that cannot reach its database should not receive requests.
External monitoring tools should check a deeper health endpoint that validates end-to-end service functionality:
GET /health/deep
{
"status": "ok",
"service": "order-service",
"version": "2.4.1",
"uptime_seconds": 142800,
"checks": {
"database": {"status": "ok", "latency_ms": 4},
"redis": {"status": "ok", "latency_ms": 1},
"payment_service": {"status": "ok", "latency_ms": 23},
"inventory_service": {"status": "ok", "latency_ms": 18}
}
}
Domain Monitor can monitor these endpoints, alerting when any check returns non-200 or when the response body indicates a dependency failure.
Despite all internal monitoring, external availability monitoring remains essential. Internal health checks tell you what the services think about themselves. External monitoring tells you what users actually experience.
Configure external HTTP monitors for:
The API gateway is particularly important in microservices. Even if all individual services are healthy, a misconfigured gateway, expired certificate, or routing change can make the entire system unreachable. External monitoring catches this when internal checks cannot.
If you use a service mesh (Istio, Linkerd, Consul Connect), you gain automatic collection of:
Service mesh observability is complementary to, not a replacement for, external availability monitoring. The service mesh sees only intra-cluster traffic; external monitoring sees the user-perspective.
Circuit breakers prevent cascading failures by stopping requests to a failing service. Monitor circuit breaker states:
An open circuit breaker is an incident signal. Alert when any critical service's circuit breaker opens.
// Express health endpoint reporting circuit breaker state
app.get('/health/ready', (req, res) => {
const paymentBreaker = circuitBreaker.getState('payment-service');
if (paymentBreaker === 'OPEN') {
return res.status(503).json({
status: 'degraded',
circuit_breakers: { payment_service: 'open' }
});
}
res.json({ status: 'ok' });
});
When a microservices incident occurs, distributed tracing tells you where in the call chain the failure originated. Tools like Jaeger, Zipkin, Tempo, or Datadog APM trace requests across service boundaries.
Combine distributed tracing with your external monitoring alerts:
External monitoring is the alarm; distributed tracing is the diagnosis. See what is application performance monitoring for a deeper look at APM in this context.
For microservices at scale, individual service monitors become unwieldy. Consider aggregating health data:
External monitoring should check both individual services and the aggregated health endpoint.
Know your service dependency graph. When service A fails, which services depend on it and will be affected? Draw or maintain a dependency map and use it to:
See how to monitor third-party API dependencies for handling external service dependencies within this model.
| Layer | Tool | What It Covers |
|---|---|---|
| External availability | Domain Monitor | User-perspective HTTP, SSL, domain |
| Container orchestration | Kubernetes health probes | Pod liveness and readiness |
| Service mesh | Istio/Linkerd metrics | Inter-service latency and errors |
| APM | Datadog, New Relic, Jaeger | Distributed tracing, error rates |
| Logs | ELK, Loki, Datadog Logs | Error and event logs per service |
| Alerting | PagerDuty, Opsgenie | Escalation and on-call management |
Start with external availability monitoring and Kubernetes health probes. Layer in APM and distributed tracing as your service count and traffic grow.
Monitor your microservices API gateway and key endpoints externally at Domain Monitor — detect user-facing failures regardless of what internal monitoring reports.
Generative AI creates new content — text, images, code, and more. This guide explains how it works, what tools are available, and where it's genuinely useful versus overhyped.
Read moreCursor AI is an AI-powered code editor built on VS Code. Learn what it does, how it works, and whether it's the right tool for your development workflow.
Read moreClaude Opus is Anthropic's most capable AI model, built for complex reasoning and demanding tasks. Learn what it does, how it compares, and when to use it.
Read moreLooking to monitor your website and domains? Join our platform and start today.