Module 21 of 21 · Practice-Strategy

Practice capstone: when dashboards look healthy and users disagree

20 min read 3 outcomes Multi-step scenario quiz

By the end of this module you will be able to:

Apply layered elimination to a realistic scenario where infrastructure metrics show healthy but users report degraded service
Choose appropriate signals at each diagnostic step and explain why the chosen signal answers the question at that layer
Write a concise, technically accurate diagnosis note that records observables, the layer of failure, and the evidence used to reach that conclusion

Real-world incident · June 2019

Google Cloud networking: external users down, internal monitoring says all clear

On June 2, 2019, Google experienced a significant networking incident affecting approximately 20 percent of its internet-facing capacity in the US. Several Google Cloud services, including YouTube, Google Drive, and Gmail, were degraded or unavailable for external users for approximately four hours. The root cause was a configuration change that inadvertently reduced the capacity of backbone traffic engineering systems.

The incident became a case study in the limitations of internal monitoring. Health checks and service monitors running on Google's internal network, which used internal routing paths unaffected by the capacity reduction, reported healthy results. External synthetic monitors, which tested connectivity from points outside Google's network, revealed the degradation. Teams had to reconcile conflicting signals: internal checks said pass, external users and external monitors said fail.

The diagnostic lesson is the same as the one that opened Module 18: a monitoring system that shares the failure domain of the service it monitors cannot reliably detect that failure. When dashboards and user reports disagree, both are data. The discipline is to understand which signals traverse which paths and reason about what each can and cannot see.

If every dashboard is green and every health check is passing, how do you investigate a user complaint that the service is unreachable?

Reconcile healthy dashboards with affected users

When aggregates look normal, segment the evidence by the user's actual path.

When dashboards and users disagree, both signals are true. Reconcile by walking the stack: every rung tests one specific hypothesis with one specific observable.

Healthy dashboards can hide affected user states

Averages need segmentation before they become operational truth.

Aggregates hide minority failures by design. Slice every customer-impact metric by region, ISP, browser, tenant, and rollout cohort before you trust a green dashboard.

21.1 The scenario: green dashboards, unhappy users

Module 20 closed the security half of this course. This capstone pulls together all three stages: the layered model from Foundations, the protocol behaviour from Applied, and the controls and signals from Practice-Strategy. The scenario is designed to require reasoning from all of them.

The scenario: it is 14:30 on a Tuesday. Your on-call phone rings. A product team reports that the checkout flow on your e-commerce platform is returning errors for a subset of users; roughly 30 percent of checkout attempts are failing with a 503 status. Your infrastructure dashboards show CPU utilisation at 18 percent across all application servers, network bandwidth at 12 percent of capacity, and no firewall drop alerts. The application server health checks are passing. The load balancer shows all backend instances as healthy.

This is not a resource exhaustion problem at the infrastructure layer. Something else is wrong. What you observe is a symptom. The question is: at which layer, and in which component, is the actual failure?

The dashboards are not lying. They are accurately reporting that CPU, bandwidth, and health checks are fine. The health checks are measuring what the load balancer can see. The load balancer cannot see what the application server is doing after it accepts a connection. Something beyond the health check boundary is failing.

With the scenario established and infrastructure metrics confirmed healthy, the investigation must work systematically down from the symptom to identify which layer is actually failing.

21.2 Layered elimination: working down from the symptom

The symptom is a 503 response on checkout. HTTP 503 means "Service Unavailable": the server understood the request but cannot fulfil it right now, typically because a dependency is unavailable or a resource limit is reached. This is a Layer 7 signal. Start there, then work toward the cause.

Layer 7 question: What is the application server's error log saying? Application logs are the right signal here: they record the event with context. If the logs show "connection pool exhausted" or "upstream timeout from payment-service," the layer of failure is now identified as an application dependency. The application server itself is healthy; it cannot reach something it needs.

Assume the logs show: "upstream connect error or disconnect/reset before headers, upstream: payment-service:8080." The application is returning 503 because it cannot reach the payment service. This is no longer a question about the application server; it is a question about connectivity from the application tier to the payment service.

Layer 4 question: Can the application server establish a TCP connection to the payment service on port 8080? A simple telnet or nc (netcat) test from the application server to payment-service:8080 answers this. If TCP does not complete the handshake (connection refused, or timeout), the failure is at Layer 4 or below. If TCP connects but the payment service returns an error response, the failure is at Layer 7 within the payment service.

Assume TCP connection is refused. The application server cannot reach port 8080 on the payment service. The symptom has moved from the application tier to the connectivity between tiers.

The layer-by-layer elimination has reached the network path between tiers. Three distinct causes can produce the same symptom at this point, and each requires a different test to confirm or rule out.

21.3 Narrowing by routing, segmentation, and state

Connection refused means one of three things: the payment service process is not listening on port 8080; a firewall or security group is blocking the connection; or the IP address being dialled is incorrect. Check each in turn.

Is the payment service listening? On the payment service host, check which ports are listening: ss -tlnp or netstat -tlnp. If port 8080 is not in the output, the process is down or misconfigured. If it is listening, the issue is network-side.

Assume port 8080 is listening on the payment service. The process is running. The failure is in the network path between the application tier and the payment service. Check the segmentation: are these two services in the same network zone or in different zones separated by a firewall or security group? If they are in different zones, check the inter-zone firewall rules or security group rules for a rule permitting the application tier to reach the payment service on port 8080.

A security group audit reveals that a deployment one hour earlier changed the payment service's security group inbound rules. The rule permitting inbound TCP 8080 from the application tier security group was accidentally deleted. The payment service is running and healthy; the connection is being dropped by the security group before it reaches the service. Adding the rule back restores service within seconds.

The investigation found the cause and restored service. The final step is documenting what happened in a form that will be useful to the next person who encounters a similar incident.

21.4 Writing the diagnosis note

A diagnosis note is not an event log. It is a structured record that another engineer can read and validate. It covers: what was observed (the symptom and its scope); what was tested (the observables used at each step); what was found (the layer and component of failure); what was changed (the remediation action); and what confirmed resolution (the observable that showed the fix worked).

A well-written diagnosis note for this scenario would read: "At 14:30, approximately 30 percent of checkout requests returned 503. Application server logs showed upstream connection errors to payment-service:8080 from 14:12 onward. TCP connection test from app-server-1 to payment-service:8080 returned connection refused despite payment service listening on 8080 (confirmed via ss on the payment host). Security group audit identified that a deployment at 14:08 removed the inbound rule permitting TCP 8080 from app-tier-sg. Rule was restored at 14:52. Error rate returned to baseline within 90 seconds. No data loss confirmed; checkout transactions failing with 503 were not partially committed (payment service logs confirm no partial transactions)."

Notice what the note does not include: "the network was slow," "something went wrong with the deployment," or "it seems like maybe the firewall." Every statement in the note is an observable or a directly confirmed fact. This is what "technically disciplined under pressure" means.

Common misconception

“If CPU and bandwidth are healthy, there is no network problem.”

CPU and bandwidth measure resources consumed by traffic that is being processed. A firewall rule that drops connections before they reach the application server leaves CPU and bandwidth unaffected, because no processing occurs. Segmentation policy failures, routing problems, and DNS resolution failures all produce symptoms that look like application errors while infrastructure metrics remain healthy. Layered elimination must include connectivity tests, not just resource metrics.

21.5 Check your understanding

At 09:15 your monitoring shows a 15 percent error rate on API requests. CPU is at 22 percent, bandwidth is at 8 percent of capacity. Application logs show 'DNS resolution failed for db.internal on 15% of requests.' What layer is failing, and what is your first diagnostic step?

You have fixed a production incident and are writing the diagnosis note. A colleague suggests adding: 'The firewall was probably misconfigured for a while before this was noticed.' Should this statement be included?

After restoring a security group rule, error rates return to baseline. You confirm the fix by watching the error rate metric for 90 seconds. A senior engineer asks: 'How do you know the partial transactions that failed during the incident did not leave the database in an inconsistent state?' Which type of signal do you check?

You are briefing management on the incident. They ask: 'Why did the health checks not catch this?' What is the accurate, concise explanation?

Author’s perspective

The pattern I see most often in post-incident reviews is engineers spending the first 30 minutes looking at dashboards they have already checked and finding nothing new. The dashboards were right. They just were not measuring the thing that broke. The discipline I try to bring is this: after the first two minutes of confirming that the obvious metrics are fine, stop looking at dashboards and start generating new observables. Run a curl with verbose output. Run a dig. Run a netcat. These take ten seconds each and produce specific data rather than confirming the absence of information. The scenario in this module resolved in under an hour not because the security group audit was clever, but because the investigation moved quickly from observation to test to test, each targeted at eliminating one hypothesis. Pace and targeting matter more than breadth.

Core distinctions

When dashboards show healthy and users report failures, both are data. Reconcile them by understanding which signals traverse which paths and what each can see.
Layered elimination: identify the symptom's layer (HTTP 503), follow the error to its source (upstream connection failure), test connectivity at each layer (TCP, DNS, routing, segmentation), until you find the layer where the failure actually occurs.
Connection refused with a listening process means a firewall, security group, or ACL is blocking the connection. Connection refused with no listener means the process is down.
Health checks validate the boundary they are positioned at. A load balancer health check on port 443 confirms the application server is listening; it cannot confirm the application server can reach its dependencies.
A diagnosis note records observables, the confirmed layer of failure, the remediation action, and the observable that confirmed resolution. It contains no speculation.
TCP reliability ends at byte-stream delivery between endpoints. Application-level consistency (no partial transactions) requires application-level evidence, not network metrics.

Standards and sources cited in this module

NIST SP 800-61 Rev. 3, Incident Response Recommendations and Considerations for Cybersecurity Risk Management
CSF 2.0 Community Profile and incident response outcomes
The current NIST incident response publication, finalised April 3, 2025. It supports the evidence-led detection, response, recovery, and improvement discipline used in the diagnosis walkthrough.
Google Cloud Infrastructure Incident Report: 19 June 2019
Incident summary and contributing factors
Google's post-incident analysis of the June 2019 networking degradation used in the opening case study. Documents the conflict between internal health checks (reporting healthy) and external monitoring (reporting degradation) that motivates the module's central diagnostic challenge.
Google SRE Book
Incident roles; Keeping a clear head; Communication
The SRE approach to incident management, including the discipline of recording observables rather than speculation. Directly supports section 21.4 on writing technically accurate diagnosis notes.
RFC 9293, Transmission Control Protocol (TCP)
Section 3.6, Closing a Connection; Section 3.4, Sequence Numbers
TCP's guarantee of reliable byte-stream delivery between endpoints is the basis for the Module 21 quiz question distinguishing network confirmation from application transaction semantics. Referenced here as the source of TCP's precise guarantee scope.

You have completed the full course. Return to the course overview for revision resources, the summary page, and the final assessment.

Back: Segmentation and blast radius Next: Summary and revision

Module 21 of 21 · Practice-Strategy