MODULE 18 OF 7 · PRACTICE AND STRATEGY

Observability and SRE

30 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

  • Distinguish the three pillars of observability: logs, metrics, and traces
  • Write a Service Level Objective (SLO) using the standard SLI/SLO/SLA template
  • Explain error budgets and how they govern reliability versus feature velocity trade-offs
  • Design the observability stack for a microservices system using OpenTelemetry
Multiple monitors showing analytics dashboards (photo on Unsplash)

Real-world incident · March 2021

A 47-minute outage that took 40 minutes to diagnose because no one had distributed tracing.

In March 2021 a UK payments company experienced a latency spike affecting 8% of checkout transactions. The on-call engineer saw the alert: p99 latency had climbed from 180 milliseconds to 4,200 milliseconds. Individual service metrics showed all services reporting normal error rates. Individual service logs showed no errors.

The investigation took 40 minutes. Without distributed tracing, the engineer had to manually correlate request IDs across five separate log streams, filtering by timestamp windows that overlapped imperfectly across services. The root cause was a database connection pool exhaustion in the inventory service, causing 3,800 milliseconds of wait time per request at that point in the chain.

Two weeks later the team deployed OpenTelemetry (the Cloud Native Computing Foundation, or CNCF, standard for distributed instrumentation) across all services. In the next comparable incident, the root cause was identified in 90 seconds: one click on the trace waterfall showed the inventory service span taking 3,800 milliseconds while all other spans were under 50 milliseconds.

When a request crosses five microservices and fails, which service caused the problem? Without distributed tracing, finding the answer requires reading five separate log streams and manually correlating timestamps. With tracing, you see the entire request journey in a single view.

With the learning outcomes established, this module begins by examining the three pillars of observability in depth.

18.1 The three pillars of observability

Observability is the ability to infer the internal state of a system from its external outputs. A system is observable if, when something goes wrong, you can understand what happened, where, and why, without deploying new code to add more logging. The three pillars are logs, metrics, and traces.

Logs are timestamped records of discrete events. They answer the question "what happened at this exact time?" They are best for debugging specific errors and producing audit trails. Tools include Elasticsearch, Grafana Loki, and Splunk.

Metrics are numeric measurements aggregated over time. They answer "how is this service behaving over the last hour?" They power dashboards, alerting, and trend analysis. Tools include Prometheus, Datadog, and Amazon CloudWatch.

Traces are end-to-end records of a request's journey across services. They answer "which service in the call chain caused this latency?" They are indispensable in microservices where a single user request may cross five or more services. Tools include Jaeger, Zipkin, and Grafana Tempo.

These pillars complement each other. Metrics alert you that something is wrong. Traces show you where in the request path the problem is. Logs show you the specific error at that location.

Observability lets us understand a system from the outside, by asking questions about it without knowing its inner workings.

Majors, C., Fong-Jones, L., Miranda, G. (2022) - Observability Engineering. O'Reilly Media, Chapter 1

The distinction is between monitoring (checking known failure modes) and observability (being able to ask arbitrary questions about system state). A monitored system can tell you when a defined threshold is exceeded. An observable system lets you diagnose failures that were not anticipated when the alerts were written.

With an understanding of the three pillars of observability in place, the discussion can now turn to structured logging and the red method, which builds directly on these foundations.

18.2 Structured logging and the RED method

Structured logging emits machine-parseable records (typically JSON) rather than unstructured text strings. A log line that reads"ERROR: Payment failed for order ord-9f8e7d6c" cannot be queried programmatically. A structured log with fields order_id,failure_reason, and trace_id can be filtered, aggregated, and correlated across services.

The trace_id field is the critical link between pillars. It connects a structured log event to the distributed trace that spans the full request. When you see an error in a log, the trace ID takes you directly to the trace waterfall showing every service that request touched.

The RED method (introduced by Tom Wilkie at Weaveworks in 2017) defines three metrics that capture the health of any service entry point. Rate is requests per second. Errors is the fraction of requests returning errors. Duration is the distribution of request latencies, typically expressed as p50, p95, and p99 percentiles. Instrument every service entry point with RED metrics to form the basis of Service Level Objectives (SLOs).

Common misconception

High error rates in logs mean the service is down.

Error rates in logs reflect errors that reached the logging layer, which is a subset of all failures. Network timeouts that are never received by the server, silent data corruption, and degraded performance without errors all cause user impact without appearing as log errors. The RED method captures errors from the client perspective (how many requests returned a 5xx status) rather than the server log perspective. Both are needed.

With an understanding of structured logging and the red method in place, the discussion can now turn to distributed tracing with opentelemetry, which builds directly on these foundations.

18.3 Distributed tracing with OpenTelemetry

A single user request crossing five services generates five separate log streams with no automatic connection between them. Distributed tracing solves this by adding a trace ID to the first incoming request and propagating it through every downstream call. Each service adds a span (a timed record of its portion of the work) to the trace. The result is a waterfall diagram showing the full request journey, with each service's contribution displayed as a bar.

OpenTelemetry, published by the CNCF (Cloud Native Computing Foundation) in 2019, is the vendor-neutral standard for instrumenting applications for all three pillars. It provides Software Development Kits (SDKs) for 12 or more languages and exporters to all major observability platforms. Adopting OpenTelemetry prevents lock-in to a single vendor's instrumentation library: the same code can export to Jaeger, Datadog, or Honeycomb by changing the exporter configuration.

In the opening incident, the trace waterfall would have shown the inventory service span at 3,800 milliseconds with a single click. The 40-minute investigation would have taken 90 seconds. That difference is the ROI of distributed tracing.

OpenTelemetry provides a single, open standard for capturing and exporting telemetry data. It removes the need to choose between vendor-specific instrumentation libraries.

OpenTelemetry documentation - What is OpenTelemetry? opentelemetry.io, 2024

Before OpenTelemetry, teams had to choose between vendor lock-in (Datadog agent, New Relic SDK) or writing their own instrumentation. OpenTelemetry decouples instrumentation (adding trace, metric, and log collection to your code) from the backend (where data is stored and visualised). Switch backends by changing exporter configuration, not by re-instrumenting your services.

With an understanding of distributed tracing with opentelemetry in place, the discussion can now turn to sre: slos and error budgets, which builds directly on these foundations.

18.4 SRE: SLOs and error budgets

Site Reliability Engineering (SRE) is a discipline originating at Google that applies engineering principles to operations. Its central mechanism is the Service Level Objective (SLO). An SLO is a target for a service's reliability expressed as a ratio of successful events over a time window.

Three terms form the hierarchy. A Service Level Indicator (SLI) is the metric you measure, for example the fraction of payment requests that return a 2xx status within 500 milliseconds. An SLO is the target for that SLI, for example 99.9% of payment requests succeed within 500 milliseconds over a rolling 30-day window. A Service Level Agreement (SLA) is a contractual commitment with penalties, for example 99.9% availability or a 10% credit on the next invoice.

The error budget is the allowed failure quantity implied by the SLO. An SLO of 99.9% means 0.1% of requests may fail. Over a 30-day month that is 43.2 minutes of allowed failure time. This number governs the trade-off between reliability work and feature velocity: if the error budget is mostly consumed, the team pauses feature deployments until the window rolls over and budget is replenished.

If a product team has an error budget that has not been used up, they are free to take risks. If the error budget has been used up, the SRE team can veto releases, or the team itself can determine that they need to slow down.

Beyer, B., Jones, C., Petoff, J., Murphy, N.R. (2016) - Chapter 3, Embracing Risk

Error budgets replace subjective arguments about reliability versus speed with a quantitative mechanism. 'Can we deploy this risky feature?' becomes 'Do we have error budget to absorb a bad deployment?' The answer comes from a number, not a negotiation. This is the governance contribution of SRE: engineering discipline applied to the release process.

Common misconception

Setting a four-nines (99.99%) SLO shows engineering rigour.

A 99.99% SLO allows 4.38 minutes of failure per month. If your service currently achieves 99.5%, every deployment risks consuming months of budget in a single incident. Set SLOs just above your current measured baseline and tighten them incrementally as reliability improves. An SLO that cannot be met does not drive reliability; it drives gaming of the measurement.

18.5 Check your understanding

A payments team's Prometheus dashboard shows p99 latency spiking to 4,000ms at 15:30. Individual service logs show no errors. Which observability pillar would most quickly identify whether the latency is internal to the payment service or caused by a downstream dependency?

A payment API has an SLO of 99.9% successful requests over a rolling 30-day window. An incident at 15:30 caused 0.08% of requests over 45 minutes to fail. How much of the monthly error budget was consumed?

What does OpenTelemetry's vendor-neutral design allow a team to do?

A team sets an SLO of 99.999% (five nines) for their internal analytics dashboard. The dashboard currently achieves 98.5% availability. What is the problem with this SLO?

Having established the three pillars of observability and the SRE reliability framework, this interactive tool allows you to explore how logs, traces, metrics, and error budgets work together in a live system context.

Explore the concepts interactively

Use this interactive diagram to explore the concepts discussed in this module. Click on elements to see how they relate to each other and to the patterns covered above.

Loading interactive component...

Key takeaways

  • Logs (events), metrics (aggregated numbers), and traces (request journeys) serve different diagnostic purposes. Metrics alert you that something is wrong; traces show you where; logs show you what happened.
  • Structured logging with trace IDs links individual log events to the distributed traces they belong to, enabling rapid incident diagnosis.
  • The RED method (Rate, Errors, Duration) provides three metrics that together capture the health of any service entry point and form the basis for SLOs.
  • SLOs define reliability targets. Error budgets are the allowed failure quantity implied by an SLO. They provide a quantitative mechanism for balancing reliability work against feature deployments.
  • OpenTelemetry is the CNCF vendor-neutral standard for instrumentation. Adopt it to avoid lock-in to any single observability platform.

Standards and sources cited in this module

  1. Beyer, B. et al. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media

    Chapter 3: Embracing Risk; Chapter 4: Service Level Objectives

    The foundational SRE text. Introduces SLOs, error budgets, and the reliability engineering practice. Quoted in Section 18.4 for the error budget governance mechanism.

  2. OpenTelemetry documentation. opentelemetry.io, 2024

    What is OpenTelemetry; Concepts: Signals, Traces, Metrics, Logs

    The CNCF standard for distributed instrumentation. Cited in Section 18.3 for the vendor-neutral design and the decoupling of instrumentation from backend.

  3. Majors, C., Fong-Jones, L., Miranda, G. (2022). Observability Engineering. O'Reilly Media

    Chapter 1: What is Observability?

    The most current practical guide to building observable systems. Quoted in Section 18.1 for the definition of observability as the ability to ask arbitrary questions about system state.

What comes next: Observable systems reveal problems. Deployment strategies control how much damage a problem can cause. Module 19 covers blue-green, canary, and rolling deployments, feature flags as a release mechanism, and the principle that deployment and release should be separate decisions.

Module 18 of 22 in Practice and Strategy