Skip to main content
DevSecOps

Observability for Microservices: Logs, Metrics, and Traces Done Right

Mohakdeep Singh|May 28, 2025|9 min read
Observability for Microservices: Logs, Metrics, and Traces Done Right

The Observability Challenge

Monolithic applications were relatively simple to debug: one process, one log file, one set of metrics. Microservices shatter this simplicity. A single user request might traverse 10 services, 3 databases, and 2 message queues before returning a response.

When something breaks -- and it will -- you need to answer "what happened and why" in minutes, not hours. That requires observability: the ability to understand a system's internal state from its external outputs.

The Three Pillars

Logs: What Happened

Logs record discrete events: errors, state changes, and significant actions.

Structured logging is mandatory. Plain text logs are nearly useless at scale. Every log entry should be a structured JSON object with consistent fields: - timestamp (ISO 8601) - level (debug, info, warn, error) - service name - trace ID and span ID (for correlation) - message - relevant context (user ID, request ID, resource identifiers)

Log aggregation: Ship logs from all services to a centralized platform: - ELK Stack (Elasticsearch, Logstash, Kibana): Self-hosted, powerful but operationally heavy - Grafana Loki: Lightweight, cost-effective, excellent for Kubernetes environments - Cloud-native: AWS CloudWatch Logs, Azure Monitor, GCP Cloud Logging

Metrics: How Is It Performing

Metrics are numerical measurements collected at regular intervals.

The RED Method for Services: - Rate: Requests per second - Errors: Error rate as a percentage of total requests - Duration: Latency distribution (P50, P95, P99)

The USE Method for Infrastructure: - Utilization: CPU, memory, disk, network usage - Saturation: Queue depths, thread pool usage, connection pool exhaustion - Errors: Hardware errors, OOM kills, disk failures

Metrics stack: - Prometheus: Industry standard for metrics collection in Kubernetes environments - Grafana: Visualization and dashboarding - Alertmanager: Alert routing and deduplication

Alerting best practices: - Alert on symptoms (high error rate, high latency), not causes (high CPU) - Use multi-window, multi-burn-rate alerts to reduce false positives - Every alert should have a runbook link describing what to do - Page only for customer-impacting issues; use tickets for everything else

Traces: Why Did It Break

Distributed tracing follows a request across service boundaries, showing you the complete journey and where time is spent.

How it works: 1. The first service creates a trace ID and span 2. Each subsequent service creates a child span, inheriting the trace ID 3. Spans record timing, service name, operation, and status 4. The complete trace is assembled from all spans sharing the same trace ID

Tracing tools: - Jaeger: Open-source, CNCF project, excellent for Kubernetes - Tempo (Grafana): Cost-effective trace storage, integrates with Grafana - Cloud-native: AWS X-Ray, Azure Application Insights, GCP Cloud Trace

Putting It All Together

Correlation Is Key

The real power of observability comes from correlating across pillars: - Click on a trace to see the logs from every service involved - Click on an error log to see the trace that produced it - View metrics dashboards filtered by a specific trace attribute

This requires consistent context propagation: every log, metric label, and trace span should share common identifiers (trace ID, service name, environment).

OpenTelemetry

OpenTelemetry (OTel) is the emerging standard for instrumentation: - Single SDK that produces logs, metrics, and traces - Vendor-neutral -- switch backends without changing application code - Auto-instrumentation for popular frameworks (Express, Spring, Django, Flask) - CNCF project with broad industry support

We strongly recommend standardizing on OpenTelemetry for all new services.

Service Level Objectives (SLOs)

Define SLOs for your critical services: - Availability: 99.9% of requests return non-error responses - Latency: 95% of requests complete within 200ms - Correctness: 99.99% of transactions are processed accurately

Use error budgets to balance reliability with feature velocity. When the error budget is exhausted, prioritize reliability work over new features.

Getting Started

  1. Week 1: Adopt structured logging across all services, ship to a centralized platform
  2. Week 2: Deploy Prometheus and Grafana, instrument RED metrics for top 5 services
  3. Week 3: Add distributed tracing with OpenTelemetry, connect to Jaeger or Tempo
  4. Week 4: Build dashboards correlating all three pillars, define SLOs for critical services

At Optivulnix, observability is a cornerstone of our DevSecOps practice. We help teams build observable systems that reduce mean time to detection and resolution. Contact us for a free observability maturity assessment.

Stay Updated

Get the latest cloud optimization insights delivered to your inbox.

Ready to Transform Your Cloud Infrastructure?

Join 100+ companies that have reduced their cloud costs by 30-60% with our AI-powered optimization platform.

Schedule Your Free Consultation