Practice capstone: when dashboards look healthy and users disagree
By the end of this module you will be able to:
- Apply layered elimination to a realistic scenario where infrastructure metrics show healthy but users report degraded service
- Choose appropriate signals at each diagnostic step and explain why the chosen signal answers the question at that layer
- Write a concise, technically accurate diagnosis note that records observables, the layer of failure, and the evidence used to reach that conclusion

Real-world incident · June 2019
Google Cloud networking: external users down, internal monitoring says all clear
On June 2, 2019, Google experienced a significant networking incident affecting approximately 20 percent of its internet-facing capacity in the US. Several Google Cloud services, including YouTube, Google Drive, and Gmail, were degraded or unavailable for external users for approximately four hours. The root cause was a configuration change that inadvertently reduced the capacity of backbone traffic engineering systems.
The incident became a case study in the limitations of internal monitoring. Health checks and service monitors running on Google's internal network, which used internal routing paths unaffected by the capacity reduction, reported healthy results. External synthetic monitors, which tested connectivity from points outside Google's network, revealed the degradation. Teams had to reconcile conflicting signals: internal checks said pass, external users and external monitors said fail.
The diagnostic lesson is the same as the one that opened Module 18: a monitoring system that shares the failure domain of the service it monitors cannot reliably detect that failure. When dashboards and user reports disagree, both are data. The discipline is to understand which signals traverse which paths and reason about what each can and cannot see.
If every dashboard is green and every health check is passing, how do you investigate a user complaint that the service is unreachable?
21.1 The scenario: green dashboards, unhappy users
Module 20 closed the security half of this course. This capstone pulls together all three stages: the layered model from Foundations, the protocol behaviour from Applied, and the controls and signals from Practice-Strategy. The scenario is designed to require reasoning from all of them.
The scenario: it is 14:30 on a Tuesday. Your on-call phone rings. A product team reports that the checkout flow on your e-commerce platform is returning errors for a subset of users; roughly 30 percent of checkout attempts are failing with a 503 status. Your infrastructure dashboards show CPU utilisation at 18 percent across all application servers, network bandwidth at 12 percent of capacity, and no firewall drop alerts. The application server health checks are passing. The load balancer shows all backend instances as healthy.
This is not a resource exhaustion problem at the infrastructure layer. Something else is wrong. What you observe is a symptom. The question is: at which layer, and in which component, is the actual failure?
The dashboards are not lying. They are accurately reporting that CPU, bandwidth, and health checks are fine. The health checks are measuring what the load balancer can see. The load balancer cannot see what the application server is doing after it accepts a connection. Something beyond the health check boundary is failing.
21.2 Layered elimination: working down from the symptom
The symptom is a 503 response on checkout. HTTP 503 means "Service Unavailable": the server understood the request but cannot fulfil it right now, typically because a dependency is unavailable or a resource limit is reached. This is a Layer 7 signal. Start there, then work toward the cause.
Layer 7 question: What is the application server's error log saying? Application logs are the right signal here: they record the event with context. If the logs show "connection pool exhausted" or "upstream timeout from payment-service," the layer of failure is now identified as an application dependency. The application server itself is healthy; it cannot reach something it needs.
Assume the logs show: "upstream connect error or disconnect/reset before headers, upstream: payment-service:8080." The application is returning 503 because it cannot reach the payment service. This is no longer a question about the application server; it is a question about connectivity from the application tier to the payment service.
Layer 4 question: Can the application server establish a TCP connection to the payment service on port 8080? A simple telnet or nc (netcat) test from the application server to payment-service:8080 answers this. If TCP does not complete the handshake (connection refused, or timeout), the failure is at Layer 4 or below. If TCP connects but the payment service returns an error response, the failure is at Layer 7 within the payment service.
Assume TCP connection is refused. The application server cannot reach port 8080 on the payment service. The symptom has moved from the application tier to the connectivity between tiers.
21.3 Narrowing by routing, segmentation, and state
Connection refused means one of three things: the payment service process is not listening on port 8080; a firewall or security group is blocking the connection; or the IP address being dialled is incorrect. Check each in turn.
Is the payment service listening? On the payment service host, check which ports are listening: ss -tlnp or netstat -tlnp. If port 8080 is not in the output, the process is down or misconfigured. If it is listening, the issue is network-side.
Assume port 8080 is listening on the payment service. The process is running. The failure is in the network path between the application tier and the payment service. Check the segmentation: are these two services in the same network zone or in different zones separated by a firewall or security group? If they are in different zones, check the inter-zone firewall rules or security group rules for a rule permitting the application tier to reach the payment service on port 8080.
A security group audit reveals that a deployment one hour earlier changed the payment service's security group inbound rules. The rule permitting inbound TCP 8080 from the application tier security group was accidentally deleted. The payment service is running and healthy; the connection is being dropped by the security group before it reaches the service. Adding the rule back restores service within seconds.
“Network troubleshooting must be systematic and evidence-based. Each test should be designed to confirm or eliminate a specific hypothesis about where a failure is occurring.”
NIST SP 800-61 Rev. 3, Computer Security Incident Handling Guide - Section 3.2, Detection and Analysis
NIST SP 800-61 Rev. 3 (published April 2024) governs incident response procedures. Section 3.2 emphasises evidence-based hypothesis testing in diagnosis. The approach in this capstone, forming a hypothesis at each layer and testing it with a targeted observable, directly implements this principle.
21.4 Writing the diagnosis note
A diagnosis note is not an event log. It is a structured record that another engineer can read and validate. It covers: what was observed (the symptom and its scope); what was tested (the observables used at each step); what was found (the layer and component of failure); what was changed (the remediation action); and what confirmed resolution (the observable that showed the fix worked).
A well-written diagnosis note for this scenario would read: "At 14:30, approximately 30 percent of checkout requests returned 503. Application server logs showed upstream connection errors to payment-service:8080 from 14:12 onward. TCP connection test from app-server-1 to payment-service:8080 returned connection refused despite payment service listening on 8080 (confirmed via ss on the payment host). Security group audit identified that a deployment at 14:08 removed the inbound rule permitting TCP 8080 from app-tier-sg. Rule was restored at 14:52. Error rate returned to baseline within 90 seconds. No data loss confirmed; checkout transactions failing with 503 were not partially committed (payment service logs confirm no partial transactions)."
Notice what the note does not include: "the network was slow," "something went wrong with the deployment," or "it seems like maybe the firewall." Every statement in the note is an observable or a directly confirmed fact. This is what "technically disciplined under pressure" means.
Common misconception
“If CPU and bandwidth are healthy, there is no network problem.”
CPU and bandwidth measure resources consumed by traffic that is being processed. A firewall rule that drops connections before they reach the application server leaves CPU and bandwidth unaffected, because no processing occurs. Segmentation policy failures, routing problems, and DNS resolution failures all produce symptoms that look like application errors while infrastructure metrics remain healthy. Layered elimination must include connectivity tests, not just resource metrics.
At 09:15 your monitoring shows a 15 percent error rate on API requests. CPU is at 22 percent, bandwidth is at 8 percent of capacity. Application logs show 'DNS resolution failed for db.internal on 15% of requests.' What layer is failing, and what is your first diagnostic step?
You have fixed a production incident and are writing the diagnosis note. A colleague suggests adding: 'The firewall was probably misconfigured for a while before this was noticed.' Should this statement be included?
After restoring a security group rule, error rates return to baseline. You confirm the fix by watching the error rate metric for 90 seconds. A senior engineer asks: 'How do you know the partial transactions that failed during the incident did not leave the database in an inconsistent state?' Which type of signal do you check?
You are briefing management on the incident. They ask: 'Why did the health checks not catch this?' What is the accurate, concise explanation?
Key takeaways
- When dashboards show healthy and users report failures, both are data. Reconcile them by understanding which signals traverse which paths and what each can see.
- Layered elimination: identify the symptom's layer (HTTP 503), follow the error to its source (upstream connection failure), test connectivity at each layer (TCP, DNS, routing, segmentation), until you find the layer where the failure actually occurs.
- Connection refused with a listening process means a firewall, security group, or ACL is blocking the connection. Connection refused with no listener means the process is down.
- Health checks validate the boundary they are positioned at. A load balancer health check on port 443 confirms the application server is listening; it cannot confirm the application server can reach its dependencies.
- A diagnosis note records observables, the confirmed layer of failure, the remediation action, and the observable that confirmed resolution. It contains no speculation.
- TCP reliability ends at byte-stream delivery between endpoints. Application-level consistency (no partial transactions) requires application-level evidence, not network metrics.
Standards and sources cited in this module
NIST SP 800-61 Rev. 3, Computer Security Incident Handling Guide
Section 3.2, Detection and Analysis; Section 3.3, Containment, Eradication, and Recovery
The NIST incident response framework (April 2024 revision). Section 3.2 defines the evidence-based analysis approach used in the diagnosis walkthrough, including the principle of testing specific hypotheses rather than general investigation.
Google Cloud Infrastructure Incident Report: 19 June 2019
Incident summary and contributing factors
Google's post-incident analysis of the June 2019 networking degradation used in the opening case study. Documents the conflict between internal health checks (reporting healthy) and external monitoring (reporting degradation) that motivates the module's central diagnostic challenge.
Incident roles; Keeping a clear head; Communication
The SRE approach to incident management, including the discipline of recording observables rather than speculation. Directly supports section 21.4 on writing technically accurate diagnosis notes.
RFC 9293, Transmission Control Protocol (TCP)
Section 3.6, Closing a Connection; Section 3.4, Sequence Numbers
TCP's guarantee of reliable byte-stream delivery between endpoints is the basis for the Module 21 quiz question distinguishing network confirmation from application transaction semantics. Referenced here as the source of TCP's precise guarantee scope.
You have completed the full course. Return to the course overview for revision resources, the summary page, and the final assessment.
Module 21 of 21 · Practice-Strategy