MODULE 5 OF 5 · APPLIED

Operations, Observability, and SRE

30 min read 3 outcomes Interactive quiz

By the end of this module you will be able to:

Explain SRE principles including error budgets, and distinguish between SLIs, SLOs, and SLAs with concrete examples
Describe the observability pillars - metrics, logs, and distributed traces - and explain why all three are required for effective diagnosis of distributed system failures
Apply the principle of blameless post-mortems and describe how chaos engineering reduces production risk through controlled failure injection

Circuit board close-up representing digital infrastructure reliability (photo on Unsplash)

Real-world innovation · Google · SRE model inception (2003)

What happens when you give an operations team an error budget?

In 2003, Google had a reliability problem that almost every technology company eventually faces. Software engineers built and shipped features as fast as possible. Operations teams deployed them and kept the lights on. When things broke, each blamed the other. The incentives were misaligned: developers wanted change, operators wanted stability.

Ben Treynor Sloss created Site Reliability Engineering by hiring software engineers into operations roles and giving them a specific constraint: an error budget. If a service had a target of 99.9% availability, the error budget was 0.1% of time per quarter (8.7 hours). Teams could spend that budget on risk: on fast deployments, on experiments, on features. But once the budget was exhausted, deployments froze until reliability recovered.

By 2023, Google's search had 99.999% availability, roughly 5 minutes of downtime per year. The error budget transformed the reliability conversation from blame to shared accountability. A team that never uses its error budget is being too conservative; a team that exhausts it is moving too fast. The number created a common language between engineering and operations for the first time.

What happens when you give an operations team an 'error budget'? Does freedom to fail actually make systems more reliable?

With the learning outcomes established, this module begins by examining sre principles: error budgets and the sli/slo/sla hierarchy in depth.

11.1 SRE Principles: Error Budgets and the SLI/SLO/SLA Hierarchy

Site Reliability Engineering (SRE) is a discipline, not a tool. Google's SRE Book, published in 2016 and available free online, defines SRE as "what happens when you ask a software engineer to design an operations function." The key insight is that reliability is a software engineering problem, not purely an operations problem.

The SLI/SLO/SLA hierarchy provides a precise vocabulary for reliability targets. A Service Level Indicator (SLI) is a specific, measurable property of a service. Common SLIs are: request success rate (percentage of HTTP requests that return a non-5xx response), latency percentile (P99 latency under 350ms), and throughput (requests per second). An SLI is a number you can measure.

A Service Level Objective (SLO) is a target value for an SLI. "99.9% of requests will return a non-5xx response in a rolling 30-day window" is an SLO. SLOs are internal commitments that drive engineering decisions. They are set by product and engineering teams.

A Service Level Agreement (SLA) is a contractual commitment to an external party, typically a customer, with defined consequences for breach. An SLA is usually a weaker target than the SLO: if the SLO is 99.9%, the SLA might be 99.5%, giving the team a buffer before contractual penalties apply. AWS, Google Cloud, and Azure publish SLAs for their services with specific credit structures for availability failures.

The error budget is derived from the SLO. If the SLO is 99.9% availability over 30 days (43,200 minutes), the error budget is 0.1% of 43,200 minutes, which is 43.2 minutes. This is the amount of downtime the team is allowed per month. The budget creates accountability: when it is spent, deployments halt.

The error budget changes the question from "was there downtime?" to "how much of our downtime budget have we spent, and are we spending it on things worth the risk?" This reframes reliability as a resource to be managed, not a metric to be minimised at all costs.

SRE gives us a framework for setting and spending reliability budgets. To know when the budget is being spent, teams need visibility into what the system is actually doing. Section 11.2 covers observability - metrics, logs, and traces - that make distributed systems diagnosable.

11.2 The Three Pillars of Observability

Observability is the ability to understand the internal state of a system from its external outputs. The term was popularised in the software context by Charity Majors (co-founder of Honeycomb) and the OpenTelemetry project. Observability differs from monitoring: monitoring watches known failure modes; observability enables investigation of unknown failures.

The three pillars are metrics, logs, and distributed traces. Each answers a different question, and all three are required for effective diagnosis of failures in distributed systems.

Metrics: time-series numerical data aggregated over time. Request rate, error rate, and latency percentiles (P50, P95, P99) are the canonical metrics for any service, collectively called the RED method (Rate, Errors, Duration). Metrics answer: "is something wrong right now?"
Logs: timestamped records of discrete events. A log entry records that something happened at a specific time with specific context. Logs answer: "what happened?" Structured logging (JSON format) makes logs searchable and parseable by tools such as Elastic, Splunk, and Datadog.
Distributed traces: follow a single request through all the services it touches, recording how long each service spent. Traces answer: "where did this specific request get slow or fail?" In a system with 30 microservices, metrics tell you P99 latency is elevated; only a trace identifies the slowdown as occurring in the database call inside the payment service.

“An error budget is the tool SRE uses to balance the risk of unavailability with the goal of innovation.”
Google SRE Book, Beyer et al. - Chapter 3: Embracing Risk
This framing is critical: the error budget is not a failure allowance; it is a risk investment vehicle. A team that spends zero of its error budget is investing zero in innovation. A team that exhausts its budget is investing too much. The budget creates the discipline of treating reliability as a scarce resource to be allocated, not a binary pass-or-fail outcome.

Common misconception

“99.9% uptime is good enough.”

99.9% availability means 8.7 hours of downtime per year. For retail banking, 99.9% allows 30 minutes of downtime per month, which can mean millions in lost transactions during peak periods. UK financial services regulators expect 99.99% for critical services (52 minutes downtime per year). NHS systems handling urgent care typically require 99.995%. The correct SLO depends entirely on the business cost of downtime, not a general standard.

Loading interactive component...

Metrics, logs, and traces are the what of observability. OpenTelemetry is the standard that defines how teams instrument their code to produce them consistently across services and vendors. Section 11.3 covers OpenTelemetry in practice.

11.3 OpenTelemetry: The Standard for Instrumentation

OpenTelemetry (OTel) is a Cloud Native Computing Foundation (CNCF) project that provides a vendor-neutral standard for collecting metrics, logs, and traces. It reached general availability for traces in 2021 and for metrics in 2023. The OpenTelemetry Collector is an agent that receives telemetry from applications, processes it, and exports it to any compatible backend (Jaeger, Prometheus, Elastic, Datadog, Grafana, and others).

Before OpenTelemetry, each observability vendor had its own SDK, requiring application code to import vendor-specific libraries. Switching from one observability platform to another required code changes across every service. OpenTelemetry separates instrumentation (in the application code) from the backend (where data is stored and queried). Changing the backend requires only a configuration change to the collector.

The OTel semantic conventions define standard attribute names for common operations: http.method, db.system,service.name. This standardisation means that traces from a Python service and a Go service can be correlated without custom parsing, enabling end-to-end trace visualisation across polyglot systems.

“OpenTelemetry provides a single, open-source standard for collecting telemetry data. It is language-agnostic and vendor-neutral.”
OpenTelemetry Project, CNCF - opentelemetry.io/docs
The significance of 'vendor-neutral' in this context is concrete: before OpenTelemetry, adopting an observability platform meant committing to that vendor's SDK in every service. OpenTelemetry ended this lock-in by providing a standard data format and collection pipeline. The choice of observability backend becomes a configuration decision, not an architectural commitment.

Loading interactive component...

Instrumentation produces the data. Incident management is what happens when that data shows something going wrong. Section 11.4 covers structured incident response and the blameless post-mortem culture that turns failures into improvements.

11.4 Incident Management and Blameless Post-Mortems

Incident management is the process of detecting, responding to, mitigating, and learning from service disruptions. The incident management lifecycle has five stages:

Detection: alerting fires based on SLO violations or anomalies
Triage: assessing severity and blast radius
Mitigation: restoring service, not necessarily fixing the root cause
Resolution: addressing the root cause
Post-mortem: learning from the incident to prevent recurrence

PagerDuty and OpsGenie are the dominant incident alerting platforms, routing alerts to on-call engineers based on severity and schedule. Well-designed alert policies fire on symptoms (high error rate, elevated P99 latency) rather than causes (CPU usage), reducing alert fatigue. A team receiving 200 alerts per day has no alerts; it has noise.

Blameless post-mortems (also called incident reviews) are a practice originating at Google and popularised by Etsy. The principle is that incidents are caused by system conditions, not individual error. A post-mortem documents: what happened, when, what the impact was, what actions were taken, what the contributing factors were, and what specific actions will prevent recurrence.

The UK's National Cyber Security Centre (NCSC) guidance on operational resilience recommends that organisations define "important business services" and set tolerance levels for disruption, aligning with Bank of England and FCA operational resilience requirements introduced in 2022 for financial services.

Grafana-style observability dashboard showing latency, error rate, and throughput metrics — Effective observability dashboards show RED metrics (Rate, Errors, Duration) as the primary view, with drill-down to logs and traces for diagnosis. Dashboards with hundreds of panels are not observability; they are noise.

Incident management responds to failures after they occur. Chaos engineering proactively introduces controlled failures to verify that the system behaves correctly before real failures cause production incidents. Section 11.5 covers chaos engineering practice.

11.5 Chaos Engineering

Chaos engineering is the practice of deliberately injecting failures into a production or production-like system to identify weaknesses before they cause incidents. The discipline was formalised by Netflix with the publication of the Chaos Engineering Manifesto in 2014 and the open-sourcing of their Chaos Monkey tool.

Netflix operates across Amazon Web Services and must tolerate the failure of individual AWS availability zones. Chaos Monkey randomly terminates virtual machine instances during business hours, forcing Netflix's engineering teams to build services that are resilient to instance failure. The logic is: if instances fail randomly in production, engineering teams have a strong incentive to build fault-tolerant services. If instances only fail during planned maintenance windows, the incentive is weaker.

Gremlin is the leading commercial chaos engineering platform, providing controlled failure injection across CPU, memory, network, and infrastructure layers with automatic blast-radius limiting and rollback. GameDays are structured chaos engineering exercises in which teams run failure scenarios against production systems with full organisation awareness, learning how systems and teams respond under controlled conditions.

The NCSC's Resilience guidance recommends that organisations in critical national infrastructure sectors (energy, finance, healthcare, transport) conduct regular resilience testing, including simulated outages, to validate their recovery capabilities against stated tolerance levels.

Common misconception

“Logging everything solves observability.”

Logging everything creates storage costs and signal-to-noise problems that make diagnosis slower, not faster. Observability requires three signals: metrics to detect problems (low cardinality, fast to query), logs to understand what happened at an event level (high cardinality, expensive), and traces to diagnose where in a distributed call chain the problem occurred. Without traces, debugging a latency spike across 30 microservices from logs alone can take hours. All three signals are required; maximising one does not substitute for the others.

Data centre infrastructure where controlled failure injection through chaos engineering produces learning rather than unplanned incidents — Modern data centre infrastructure introduces failure modes that cannot be predicted in advance. Chaos engineering makes failure a planned event that produces learning, rather than an unplanned event that produces incidents.

Loading interactive component...

11.6 Check your understanding

A payment service has a 99.95% availability SLO over a 30-day period (43,200 minutes). The error budget is 0.05% of 43,200 minutes. Three weeks into the month, the team has used 19 minutes of downtime. The product manager wants to push a significant new feature that carries a 10% risk of a 15-minute outage. Should they proceed?

A P99 latency alert fires at 3:00am for a service that processes 2,000 requests per second across 15 microservices. The on-call engineer has metrics (elevated P99) but no logs above DEBUG level and no distributed traces configured. What is the most likely outcome, and what should have been configured to prevent it?

Netflix uses Chaos Monkey to randomly terminate production VM instances during business hours. A new engineer argues this is irresponsible. What is the strongest counter-argument?

Key takeaways

SLIs measure specific service properties; SLOs set internal targets for SLIs; SLAs are contractual commitments to customers based on SLOs, typically set lower than the SLO to create a buffer before penalties apply.
Error budgets transform reliability from a binary measure into a resource that teams can invest in risk, creating a shared language between engineering and operations about how much change is safe in a given period.
The three observability pillars are metrics (is something wrong?), logs (what happened?), and distributed traces (where in the call chain did it fail?); all three are required for diagnosis of distributed system failures.
OpenTelemetry is the CNCF standard for vendor-neutral instrumentation, separating telemetry collection from the observability backend and ending the lock-in of vendor-specific SDKs in application code.
Blameless post-mortems treat incidents as system failures, not individual errors; the output is specific remediation actions, not individual accountability.
Chaos engineering makes failure a planned, observable, learnable event by injecting controlled failures into production systems, building genuine resilience rather than assumed resilience.

Standards and sources cited in this module

Beyer, B. et al., Site Reliability Engineering
O'Reilly Media, 2016. Free at sre.google/sre-book
Primary source for SRE principles, error budget definition, and SLI/SLO/SLA hierarchy. Referenced throughout Sections 11.1 and 11.4.
OpenTelemetry Project, CNCF
opentelemetry.io
Primary source for OpenTelemetry specification, semantic conventions, and collector documentation. Referenced in Section 11.3.
Principles of Chaos Engineering
principlesofchaos.org
Foundational reference for chaos engineering practice, first published by Netflix engineers in 2014. Referenced in Section 11.5.
NCSC, Operational Resilience Guidance for CNI Sectors
ncsc.gov.uk/guidance/operational-resilience
UK government guidance on resilience testing requirements for critical national infrastructure. Referenced in Sections 11.4 and 11.5.
FCA and PRA, Operational Resilience Policy Statement
PS21/3, March 2021
UK financial services regulatory requirement for important business service definitions and impact tolerances. Referenced in Section 11.4.

Applied digitalisation is complete. You can build, measure, integrate, plan, and operate digital systems. Stage 3 moves to enterprise strategy: roadmaps, operating models, governance, and the decisions that determine whether digital capability translates to organisational advantage.

Previous: Capability maps, value streams, and TOGAF Next: Strategy, roadmaps, and target operating models

Module 11 of 15 in Applied