Applied capstone: diagnose a slow website without guessing
By the end of this module you will be able to:
- Trace a complete multi-layer failure through DNS, TCP, TLS, and application behaviour
- Write a protocol-aware diagnosis note that names evidence, not just layer labels
- Identify weak explanations and reject them in favour of evidence-led conclusions
16.1 What this capstone tests
Modules 9 through 15 covered TCP, DNS, UDP, QUIC, routing, NAT, TLS, and a systematic troubleshooting method. This capstone applies all of them to a single scenario requiring layer-by-layer reasoning, protocol-aware language, and a four-part diagnosis note that could be handed to another engineer.
The scenario is deliberately realistic. Multiple symptoms are present. Some are misleading. The diagnosis requires ruling out plausible but wrong explanations and identifying the specific protocol failure with evidence.
16.2 The scenario
Context. Users in the London office report that checkout.company.com loads slowly or times out, but only since 09:00 today. The site works fine from the New York office. The development team says they deployed no code changes overnight. The infrastructure team says no server restarts occurred. You are the first person on-call to investigate.
Step 1: State the symptom precisely. Slow or timeout on checkout.company.com from London since 09:00. New York is unaffected. No application or server changes. The problem is environment-specific and time-bounded.
Step 2: Check DNS from the affected location. You rundig checkout.company.com @10.20.1.5 (the London DNS resolver) from a London server. The response comes back in 3800 ms. The same query against Google's 8.8.8.8 takes 45 ms. The A record returned is the same in both cases (93.184.216.50). DNS is resolving correctly but slowly from the London resolver.
Step 3: Test TCP connectivity. You runcurl -v --connect-timeout 10 https://checkout.company.com from London. The DNS phase takes 3.8 seconds. After DNS resolves, TCP connects in 85 ms. TLS handshakes in 92 ms. The HTTP 200 response returns in 160 ms. Total: 4.2 seconds, of which 3.8 seconds is DNS.
Step 4: Isolate the DNS delay. The London resolver at 10.20.1.5 is adding 3.8 seconds to every lookup. You check whether this is specific to this domain. You rundig google.com @10.20.1.5 and get a 45 ms response (served from cache). You run dig checkout.company.com @10.20.1.5 a second time and get a 45 ms response. The TTL on the record is 300 seconds. The first lookup was slow; subsequent ones are fast. The resolver was not caching company.com records.
Step 5: Find the cache miss cause. You check the London resolver's configuration log. At 09:00, a resolver configuration update was deployed that flushed the DNS cache. Since 09:00, every first lookup for any company.com hostname has required a full recursive resolution from the authoritative server. The company's authoritative server is in the US. Each uncached lookup from London adds a full transatlantic round trip to resolution time, approximately 150 ms per delegation step. For a three-step resolution (root, .com TLD, authoritative), that is 450 ms per step plus processing.
Step 6: Write the diagnosis note. Symptom: checkout.company.com loads slowly (4+ seconds) from London since 09:00. Other offices unaffected. First failed function: the London DNS resolver (10.20.1.5) is returning slow (3.8 second) responses for uncached company.com records after a 09:00 cache flush. DNS for other cached domains is fast (45 ms). TCP and TLS are normal once DNS resolves. Evidence: dig checkout.company.com @10.20.1.5takes 3.8 seconds on first query, 45 ms on second query (TTL 300). The resolver cache was flushed at 09:00 per the configuration log. Next test: increase the TTL on company.com records from 300 seconds to 3600 seconds to reduce the frequency of cold-cache lookups, and confirm the resolver's configuration change was intentional.
16.3 What the scenario demonstrates
Several plausible wrong explanations were available in this scenario. The TLS handshake could have been the problem (it was not; 92 ms is normal). The application server could have been slow (it was not; HTTP response was 160 ms). A routing change could have added latency (no; traceroute showed a normal path). The London firewall could have been blocking traffic (no; TCP connected fine).
Each of these would have been tested before reaching the correct answer, had the diagnosis started at the wrong layer. By starting at DNS and measuring resolution time, the 3.8 second anomaly was visible in the first test. One data point pointed at the right function. The remaining investigation confirmed it.
This is what good troubleshooting produces: a short path from symptom to evidence to conclusion, with discarded wrong explanations listed explicitly. Rejecting a weak explanation is not wasted work. It is part of the diagnosis.
The diagnosis is "London DNS resolver cache was flushed, causing 3.8-second cold lookups for company.com." Not "DNS is slow." Not "the network is acting up." One sentence, one specific mechanism, backed by one measurement.
16.4 A harder scenario: connection succeeds but responses are corrupted
A second scenario, for additional practice. Users report that an API endpoint returns garbled or truncated responses intermittently. The API is a JSON REST service over HTTPS.
What you observe. DNS resolution is normal. TCP connects successfully. TLS negotiates without error. The HTTP response arrives with a 200 status code. The response body is truncated: valid JSON begins, then stops mid-object. This happens on roughly 30% of requests.
The first failed function. This is not DNS (correct responses). It is not TCP connection (established). It is not TLS handshake (completed). The HTTP response arrives but the body is wrong. The failure is at the application layer, specifically in how the response body is being delivered.
Narrowing with evidence. You capture traffic with tcpdump. You observe that the HTTP response arrives in two TCP segments. The first segment carries the headers and part of the body. The second segment is never delivered. The client sees a TCP FIN from the server before the body is complete. The server is closing the connection before sending the full response.
Investigation leads to the cause. The web server has a misconfigured request timeout: 500 ms. The application sometimes takes 600 ms to generate the response. When it does, the server's timeout fires and the connection is closed before the response body completes. The client receives a partial body and a premature TCP FIN.
The diagnosis note. Symptom: JSON API returns truncated responses on approximately 30% of requests. First failed function: the server closes the TCP connection before the response body completes; TCP FIN observed mid-body in packet capture. Evidence: tcpdump shows TCP FIN from server after partial body, HTTP status 200 arrives but response is incomplete. Next test: check the server's request timeout configuration and measure the distribution of application response times to confirm some requests exceed the timeout.
Common misconception
“The problem must be in the application code because the network is fine.”
'The network is fine' is a conclusion that requires evidence, not an assumption. In the first scenario above, the network connectivity was fine but the DNS resolver was misconfigured. In the second, TCP and TLS were fine but a timeout at the server caused connection termination that looked like a network issue. Every layer must be explicitly confirmed before being excluded.
16.6 What comes next
You have completed the Applied stage. You can now explain TCP, DNS, UDP, QUIC, routing, NAT, and TLS with standards-aligned language. You have a systematic method for diagnosing request failures, a set of tools matched to each layer, and a four-part format for communicating diagnosis to others.
The Practice and Strategy stage builds on this foundation. It covers how to place security controls at the layer where the risk actually forms, choose observability signals that explain behaviour rather than just recording it, and use segmentation and packet capture deliberately rather than as first guesses.
If any module in this Applied stage felt uncertain, revisit it before moving on. The Applied practice test (available from the course overview) covers all eight Applied modules. It will show which areas need a second pass. The Foundations stage practice test is also still available for the underlying vocabulary and protocol data units.
Key takeaways
- A slow website is a measurement problem before it is a configuration problem. Break down the time: DNS, TCP, TLS, application. The dominant component is where to look first.
- Packet capture is objective evidence. Server-side TCP FIN before the response completes is a server behaviour, independent of what a colleague can reproduce interactively.
- Reject weak explanations explicitly. 'The network is fine' is a conclusion requiring evidence. Each layer must be confirmed before being excluded.
- Escalate with a diagnosis note, not a symptom. The receiving team works faster with specific evidence and a clear next action than with a vague problem description.
Standards and sources cited in this module
RFC 9293, Transmission Control Protocol (TCP)
Section 3.6, Closing a Connection (FIN handling)
Defines TCP connection close behaviour. Referenced for the TCP FIN truncation scenario in Section 16.4.
RFC 1034 and RFC 1035, Domain Name System
RFC 1034 Section 3.7, Queries and responses; RFC 1035 Section 4, Messages
Referenced for the DNS resolution time analysis in the first scenario walkthrough.
CompTIA Network+ N10-009 Exam Objectives
Domain 5.0, Network Troubleshooting; Objective 5.5: Scenario-based troubleshooting
The capstone scenarios align with the scenario-based troubleshooting objective, testing the ability to isolate faults across layers with tool evidence.
Cisco CCNA 200-301 v1.1 Exam Topics
Section 6.0, Automation and Programmability; Troubleshooting in complex scenarios
Multi-step diagnosis scenarios align with CCNA's advanced troubleshooting requirements.
You have completed the Applied stage. The Practice and Strategy stage begins with Module 17, which uses your layer knowledge to place security controls where the risk actually appears, not where it is convenient.
Module 16 of 21 · Applied stage complete