Foundations · Module 6

Operations

Run systems reliably in production with Google SRE principles, four golden signals, observability triad, and effective incident response.

27 min 4 outcomes Software Architecture Foundations

Previously

Deployment and CI and CD

Automate secure deployments with CI/CD pipelines, DevSecOps integration, deployment strategies, and Infrastructure as Code basics.

This module

Operations

Run systems reliably in production with Google SRE principles, four golden signals, observability triad, and effective incident response.

Next

OSI model and diagnostics

Master troubleshooting tools and techniques using OSI layers, browser DevTools, command-line diagnostics, and TLS certificate inspection.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

Operations are part of the architecture.

What you will be able to do

  • 1 Explain operations in your own words and apply it to a realistic scenario.
  • 2 Operations is feedback. Signals should trigger action, not only dashboards.
  • 3 Check the assumption "Signals reflect user impact" and explain what changes if it is false.
  • 4 Check the assumption "Runbooks exist" and explain what changes if it is false.

Before you begin

  • No previous technical background required
  • Read the section explanation before using tools

Common ways people get this wrong

  • Alert fatigue. Too many alerts teaches people to ignore them.
  • No learning loop. If incidents do not improve systems, you repeat them.

Run systems reliably in production with Google SRE principles, four golden signals, observability triad, and effective incident response.

Operations are part of the architecture. We design for evidence, response, and learning, not only uptime.

Mental model

Operate with signals

Operations is feedback. Signals should trigger action, not only dashboards.

  1. 1

    System

  2. 2

    Signals

  3. 3

    Alerts

  4. 4

    Runbook

Assumptions to keep in mind

  • Signals reflect user impact. Signals that ignore user impact hide real failures.
  • Runbooks exist. If you have no runbook, incidents become improvisation.

Failure modes to notice

  • Alert fatigue. Too many alerts teaches people to ignore them.
  • No learning loop. If incidents do not improve systems, you repeat them.

Check yourself

Quick check. Operations

0 of 5 opened

Name the four golden signals

Latency, traffic, errors, and saturation.

What is an SLO

A service level objective. A target for service behaviour that matches user experience, such as p95 latency or error rate.

What is the difference between logs, metrics, and traces

Logs are event records, metrics are numeric time series, and traces show how a request flows across services.

Scenario. Latency spikes but errors stay low. What do you check next

Saturation signals, downstream dependencies, and recent deploys. High latency often shows pressure before failure.

What is a runbook for

A calm, repeatable response guide for common incidents, including containment and rollback steps.

Artefact and reflection

Artefact

A short module note with one key definition and one practical example

Reflection

Where in your work would explain operations in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?

Optional practice

SRE principles, golden signals, observability, and incident response