Module 15 of 21 · Applied

A repeatable method for troubleshooting requests

16 min read 3 outcomes Scenario quiz

By the end of this module you will be able to:

  • Apply a layer-by-layer isolation method to any failing or slow request
  • Select the appropriate tool at each layer and interpret its output correctly
  • Write a short, actionable diagnosis note and know when to escalate
Engineer running diagnostic commands on a terminal with network graphs on a second monitor

The critical skill

The difference between collecting symptoms and producing a diagnosis someone can act on

Troubleshooting without a method is symptom collection. You observe the failure, make a guess, try a fix, and check whether the symptom persists. If it does, you make a different guess. This works sometimes, by accident, but it is slow and destroys evidence. When the problem is intermittent or involves multiple people, it becomes actively harmful.

A method gives you the same information faster, with evidence that survives handoffs. Each step confirms one layer works or isolates one layer as the failure point. When you know which layer failed, you know which tool to use next and which team to escalate to. The network is not a black box. It is a series of independently testable functions.

Two engineers are both looking at the same failing service. One says 'the network is down.' The other says 'DNS resolves correctly, TCP to port 443 times out from VLAN 100 but not from VLAN 200, traceroute stops at the firewall at 10.0.0.1.' Which engineer has done diagnosis?

15.1 The method: isolate by layer

Every request goes through a predictable sequence of functions. DNS resolves a name to an IP address. The operating system opens a TCP connection (or QUIC connection for HTTP/3). TLS negotiates an encrypted channel. The application protocol (HTTP, SMTP, etc.) sends the actual request. Each step depends on the previous one.

When a request fails, the method is: identify the first function in the chain that did not succeed. Test that function specifically. If it fails, that is where the fault is. If it succeeds, move to the next function. Do not change anything until you have confirmed which function failed.

This sounds obvious. In practice, most failures to diagnose correctly happen because someone skipped to a layer they assumed was the problem, or changed two things at once and lost the ability to know which change made the difference.

15.2 Tools by layer

Each layer has specific tools that produce specific evidence. Using the wrong tool at the wrong layer produces noise, not signal.

Layers 1 and 2 (Physical and Data Link). Check physical connectivity, interface status, and error counters. On Linux:ip link show shows interface state and error counts.ethtool eth0 shows physical link speed and duplex. On managed switches, check port statistics for CRC errors and input errors.

Layer 3 (Network). Use ping to test basic IP reachability. A successful ping proves the path to the IP address works at Layer 3. A failed ping may indicate routing, firewall blocking ICMP, or the host being down. Use traceroute (or tracerton Windows) to trace the hop path. Traceroute shows where in the path packets stop, but be aware that ICMP rate limiting and asymmetric routing can make the picture incomplete.

Layer 7 (Application), DNS. Use dig domain.comor nslookup domain.com to query DNS resolution. Include a specific resolver to test: dig domain.com @8.8.8.8 compares what Google's resolver returns versus your local resolver. Check for SERVFAIL, NXDOMAIN, or unexpected IP addresses.

Layer 4 (Transport). Use curl -v ortelnet host port to test TCP connectivity to a specific port. A "Connection refused" means the host is reachable but nothing is listening on that port. A timeout means the port is blocked (firewall) or the host is unreachable. A successful connection banner means TCP works; the problem (if any) is at a higher layer.

TLS layer. Use openssl s_client -connect host:443to test TLS. The output shows the certificate chain, handshake result, cipher suite, and any TLS errors. Look for "Verify return code: 0 (ok)" for success and error messages like "certificate verify failed" or "handshake failure" for failures.

Application layer (HTTP). Use curl -v https://host/pathfor a complete end-to-end test. The verbose output shows DNS resolution, TCP connection time, TLS handshake details, HTTP headers, and response code. A 200 means success. 4xx means a client-side request problem. 5xx means the server errored. A redirect loop shows as repeated 3xx responses.

Packet capture. Use tcpdump or Wireshark for full packet visibility when other tools are insufficient. Capture is intrusive and produces large amounts of data. Use it when you need to see exactly what is on the wire, for example to confirm retransmissions, zero-window events, or unexpected RST packets.

15.3 Reading error messages correctly

Error messages from network tools are precise. "Connection refused" is not the same as "Connection timed out." "SERVFAIL" is not the same as "NXDOMAIN." Reading them accurately is part of the diagnosis.

Connection refused (TCP RST). The destination host is reachable. No service is listening on the target port, or a firewall on the host itself is actively refusing the connection. The host responded with a TCP RST.

Connection timed out. Packets are being dropped without a response. Either a firewall is silently dropping packets, the host is unreachable, or routing is blackholing traffic. No RST was received.

DNS SERVFAIL. The recursive resolver attempted to answer but encountered a failure, often because the authoritative server was unreachable or returned an error.

DNS NXDOMAIN. The authoritative server confirmed the domain does not exist. This is a definitive answer, not a failure to reach the server.

TLS certificate verify failed. The certificate chain could not be validated. Common causes: missing intermediate certificate, expired certificate, hostname mismatch, or untrusted CA.

15.4 Writing a diagnosis note

A diagnosis note has four parts. Symptom: what was observed, by whom, and when. First failed function: which layer failed, with the specific protocol and command output. One observable: the exact command and its relevant output. Next safe test: one specific action that would confirm or narrow the diagnosis.

Example. A developer reports an API endpoint is returning no response.

Symptom: POST https://api.company.com/orders returns no response after 30 seconds from the production application server (10.2.1.5).

First failed function: TCP connection to api.company.com:443 times out from 10.2.1.5. DNS resolves correctly (93.184.216.34). The failure is at Layer 4.

Observable: curl -v --connect-timeout 5 https://api.company.comfrom 10.2.1.5 returns "Connection timed out" after 5 seconds. The same command from the developer's laptop succeeds in 0.3 seconds.

Next safe test: Check firewall rules between VLAN 200 (where 10.2.1.5 lives) and the internet for outbound port 443. A recent change log entry may identify a new rule that dropped this path.

Anyone reading this note knows the symptom, where it fails, the evidence, and what to look at next. No expertise is required. No time is wasted checking the application code.

15.5 When to escalate

Escalate when you have confirmed which layer failed but do not have access to fix it. You have verified the fault is in the firewall configuration, but only the network team can change firewall rules. Escalate with the diagnosis note, not with the symptom. "Firewall between VLAN 200 and internet is blocking port 443, confirmed by traceroute and curl timeout" is an escalation. "The internet is down" is not.

Escalate when you have reached the limit of your visibility. You can see that traceroute stops at a specific hop but cannot tell whether that hop is a misconfigured router or a carrier network issue. The diagnosis note records what you can see and where the evidence ends.

Common misconception

Restart it and see if that fixes it.

Restarting a service without a diagnosis destroys the state evidence that would explain the fault. If the restart works, you have masked the problem and it will return. If it does not work, you have wasted time and changed the state in ways that may complicate further diagnosis. Restart after diagnosis, not instead of it. Use 'restart and observe' as a specific hypothesis test, not as a first guess.

15.6 Check your understanding

Users report a website is unreachable. You can ping the server IP but curl returns 'connection refused.' Which layer is the problem at?

You run 'dig api.example.com @8.8.8.8' and get a valid A record. You run 'dig api.example.com @10.1.1.5' (your company resolver) and get SERVFAIL. What does this tell you?

An engineer suspects a TLS issue. Which command provides the most useful evidence for TLS-specific diagnosis?

Key takeaways

  • The method: identify the first failed function in the request chain, test it specifically, confirm the fault before moving or changing anything.
  • Use tools at the layer of the suspected failure: ping and traceroute for Layer 3, curl and telnet for Layer 4, openssl s_client for TLS, dig for DNS, curl -v for the full stack.
  • Read error messages precisely. 'Connection refused' (RST) is different from 'connection timed out' (dropped). 'SERVFAIL' is different from 'NXDOMAIN'.
  • A good diagnosis note has four parts: symptom, first failed function with layer and evidence, one observable (command plus output), and one next safe test.

Standards and sources cited in this module

  1. RFC 1122, Requirements for Internet Hosts: Communication Layers

    Section 3.2, Applications on Internetworks

    The requirements for host networking behaviour underpin the layer-by-layer isolation method. Referenced for the protocol stack that tools must test.

  2. CompTIA Network+ N10-009 Exam Objectives

    Domain 5.0, Network Troubleshooting; Objectives 5.2 and 5.3: Tools and methodology

    The troubleshooting domain specifically tests systematic methodology and tool selection. This module aligns directly with these objectives.

  3. Cisco CCNA 200-301 v1.1 Exam Topics

    Section 6.0, Automation and Programmability; Troubleshooting network issues

    The CCNA troubleshooting objective includes ping, traceroute, and systematic isolation. The tools in Section 15.2 align with CCNA lab objectives.

You now have a method. Module 16, the Applied capstone, tests it end-to-end with multi-step scenarios where you diagnose a slow website without guessing, using evidence at each step.

Module 15 of 21 · Applied stage