Foundations · Module 6
Operations
Run systems reliably in production with Google SRE principles, four golden signals, observability triad, and effective incident response.
Previously
Deployment and CI and CD
Automate secure deployments with CI/CD pipelines, DevSecOps integration, deployment strategies, and Infrastructure as Code basics.
This module
Operations
Run systems reliably in production with Google SRE principles, four golden signals, observability triad, and effective incident response.
Next
OSI model and diagnostics
Master troubleshooting tools and techniques using OSI layers, browser DevTools, command-line diagnostics, and TLS certificate inspection.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
Operations are part of the architecture.
What you will be able to do
- 1 Explain operations in your own words and apply it to a realistic scenario.
- 2 Operations is feedback. Signals should trigger action, not only dashboards.
- 3 Check the assumption "Signals reflect user impact" and explain what changes if it is false.
- 4 Check the assumption "Runbooks exist" and explain what changes if it is false.
Before you begin
- No previous technical background required
- Read the section explanation before using tools
Common ways people get this wrong
- Alert fatigue. Too many alerts teaches people to ignore them.
- No learning loop. If incidents do not improve systems, you repeat them.
Run systems reliably in production with Google SRE principles, four golden signals, observability triad, and effective incident response.
Operations are part of the architecture. We design for evidence, response, and learning, not only uptime.
Mental model
Operate with signals
Operations is feedback. Signals should trigger action, not only dashboards.
-
1
System
-
2
Signals
-
3
Alerts
-
4
Runbook
Assumptions to keep in mind
- Signals reflect user impact. Signals that ignore user impact hide real failures.
- Runbooks exist. If you have no runbook, incidents become improvisation.
Failure modes to notice
- Alert fatigue. Too many alerts teaches people to ignore them.
- No learning loop. If incidents do not improve systems, you repeat them.
Check yourself
Quick check. Operations
0 of 5 opened
Name the four golden signals
Latency, traffic, errors, and saturation.
What is an SLO
A service level objective. A target for service behaviour that matches user experience, such as p95 latency or error rate.
What is the difference between logs, metrics, and traces
Logs are event records, metrics are numeric time series, and traces show how a request flows across services.
Scenario. Latency spikes but errors stay low. What do you check next
Saturation signals, downstream dependencies, and recent deploys. High latency often shows pressure before failure.
What is a runbook for
A calm, repeatable response guide for common incidents, including containment and rollback steps.
Artefact and reflection
Artefact
A short module note with one key definition and one practical example
Reflection
Where in your work would explain operations in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?
Optional practice
SRE principles, golden signals, observability, and incident response