Which observability signals actually help
By the end of this module you will be able to:
- Distinguish logs, metrics, and distributed traces as signal types with different strengths and appropriate use cases
- Explain when to use SNMP polling, NetFlow/sFlow export, or application-level telemetry for a given network question
- Describe how OpenTelemetry (OTel) unifies signal collection across infrastructure and applications
- State why adding more signals without reducing noise typically worsens incident response rather than improving it

Real-world incident · October 2021
Facebook's six-hour outage: when the monitoring tools vanish with the network
On October 4, 2021, a routine BGP (Border Gateway Protocol) configuration change at Facebook withdrew all of Facebook's BGP route announcements from the global internet. The domains facebook.com, instagram.com, and WhatsApp stopped resolving for approximately six hours, affecting roughly 3.5 billion users.
The diagnosis took longer than the fix. Facebook's internal monitoring and communication tools, including the dashboards and alerting systems engineers relied on, were hosted on the same infrastructure that had just become unreachable from the internet. Engineers physically travelled to data centres because remote access via VPN also traversed the broken paths. External synthetic monitors flagged the issue within minutes, but the internal tooling that would explain the cause was inaccessible.
The incident illustrated a critical property of observability that is easy to overlook: the signal pipeline itself must remain available during the incident it is meant to diagnose. Signals hosted on the same infrastructure as the affected service will fail at the same time as the service. The organisations with the fastest recovery times had a mix of internal and external signal sources, and their alerting systems ran on separate, independently routed infrastructure.
Facebook engineers had metrics, logs, and dashboards. Why did it take hours to diagnose a BGP misconfiguration that removed the company from the internet?
18.1 The three pillars: logs, metrics, and traces
Module 17 placed security controls at the layer where threats form. This module applies the same discipline to observability: choosing signals based on the specific question you are trying to answer. The three foundational signal types are logs, metrics, and distributed traces, and they are not interchangeable.
A log is a timestamped record of a discrete event: a packet was dropped, a connection was refused, a user authenticated. Logs are high in detail and low in aggregation. They answer "what happened and when?" but become expensive to query at scale. A busy firewall can produce millions of log entries per hour; searching them during an incident under time pressure is difficult without pre-built indexes.
A metric is a numeric measurement sampled over time: packets per second, CPU utilisation, connection table fill percentage, error rate. Metrics are low in detail and high in aggregation. They answer "is this number changing?" and can be graphed and alerted on efficiently. The cost is loss of context: a spike in error rate tells you something is wrong but not which specific request failed or why.
A distributed trace follows a single request across multiple services, recording timing and outcome at each hop. Traces answer "where did this specific request spend its time?" They are essential for latency diagnosis in distributed systems but require instrumentation at every service that touches the request.
Logs tell you what happened. Metrics tell you how often and how fast. Traces tell you where a specific request went and how long each hop took. Choosing the wrong signal type does not mean the answer is unavailable; it means you are paying a higher cost to find it.
18.2 Network-specific signals: SNMP, NetFlow, and sFlow
Network devices produce their own distinct signal types. The Simple Network Management Protocol (SNMP) is the oldest and most widely supported. An SNMP manager polls managed devices at regular intervals, reading counters from the Management Information Base (MIB): interface utilisation, error counts, neighbour state, hardware status. SNMP traps provide the reverse direction: a device sends an unsolicited notification when a threshold is crossed.
NetFlow is a flow export protocol developed by Cisco and standardised in RFC 3954. A network device observes traffic passing through it and produces flow records: source IP, destination IP, source port, destination port, protocol, byte count, and packet count for each distinct flow. NetFlow records are exported to a collector where they can be queried to answer questions like "which destinations is this server talking to?" or "what is consuming the most bandwidth on this link?" NetFlow does not capture packet content; it captures conversation metadata.
sFlow (RFC 3176) takes a different approach: it samples a fraction of packets (for example, one packet in every 1,000) and exports the sampled headers. This scales to higher link speeds where counting every flow would overwhelm the device, at the cost of statistical accuracy rather than exact counts. Both NetFlow and sFlow are more useful than SNMP for traffic pattern analysis because they show per-flow behaviour, not just interface totals.
“This document describes the NetFlow export format, version 9. The export format records information about flows observed at a network device.”
RFC 3954 - Section 1, Introduction
RFC 3954 standardised Cisco's NetFlow v9 format, which became the basis for IETF IPFIX (RFC 7011). The flow record structure defined here is what network analysis tools such as ntopng, Elastic and Splunk ingest when they receive flow data from routers and switches.
18.3 OpenTelemetry: one collection framework for all three signals
OpenTelemetry (OTel) is a Cloud Native Computing Foundation (CNCF) project that defines a vendor-neutral API, SDK, and wire protocol for collecting logs, metrics, and traces. Before OTel, instrumenting an application meant integrating separate libraries for each signal type, each with its own format and export destination. Switching observability backends required re-instrumenting the application.
The OTel Collector is an agent that can receive telemetry in multiple formats (including Prometheus metrics, Jaeger traces, and OpenCensus data), process it, and export it to any supported backend. This means the application code calls OTel APIs; the routing decision (which backend receives the data) is a configuration decision, not a code change. For network engineers, the OTel semantic conventions define how network-related attributes (peer address, transport protocol, DNS query time) should be recorded in traces and metrics, making cross-service correlation possible.
18.4 Alert fatigue and choosing the right signal
Alert fatigue occurs when a system produces so many alerts that engineers stop treating them as meaningful signals. The typical cause is not a lack of metrics but a surplus of metrics without clear thresholds, combined with alerting on symptoms that have no direct remediation path. A CPU alert that fires every day because the system is working normally at high load is noise; it trains engineers to ignore alerts, which is exactly the condition that causes real incidents to go undetected.
The correct discipline is to start with the user-facing question: "Is the service responding within the agreed time?" Then trace backwards: which intermediate signal predicts or explains a user-visible failure? Alert on that signal, with a threshold that demands a human response when crossed. Infrastructure signals (disk, CPU) should be recorded as metrics for diagnosis but not alert unless they are on a direct causal path to a user-visible failure.
The Facebook outage also illustrated signal resilience: signals that run on the same infrastructure as the service they observe will fail alongside that service. External synthetic monitoring (scripted requests from independent vantage points) can detect user-facing failures even when all internal tooling is dark. Both are necessary.
Common misconception
“More metrics means better observability.”
More metrics without clear purpose increases noise and accelerates alert fatigue. Effective observability starts with the user-facing question you need to answer, then identifies the minimum set of signals that reliably predict or explain failures at that level. Adding metrics that are never actioned, or that alert without a clear remediation path, degrades incident response rather than improving it.
Your network team reports that bandwidth on an internet-facing link has been unusually high for two days. Which signal type would most efficiently tell you which destinations are consuming the bandwidth?
A distributed web service is intermittently slow. Error rates are normal and infrastructure metrics are healthy. Users report that specific pages take 8-12 seconds to load instead of under 1 second. Which signal type is most likely to reveal where the latency is being added?
An operations team has 47 different alert rules configured. Engineers acknowledge most alerts without investigating because so many fire during normal business hours. What is the primary cause, and what should be changed?
Key takeaways
- Logs, metrics, and distributed traces answer different questions. Choose the signal type based on the question, not on what is easiest to collect.
- SNMP gives interface and device-level counters. NetFlow gives per-flow conversation metadata. sFlow gives sampled packet headers at scale. Each suits a different network question.
- RFC 3954 standardised NetFlow v9; IETF IPFIX (RFC 7011) extended it. These are the formats network analysis tools ingest from routers and switches.
- OpenTelemetry provides a vendor-neutral API and SDK for collecting logs, metrics, and traces without re-instrumenting when you change backends.
- Alert fatigue is caused by alerts that fire without requiring human action. Fix it by auditing each alert against user-visible failure and remediation path, not by adding more tooling.
- Signal pipeline resilience matters: internal monitoring that runs on the same infrastructure as the service it observes will fail during the same incident. External synthetic monitoring provides independent visibility.
Standards and sources cited in this module
RFC 3954, Cisco Systems NetFlow Services Export Version 9
Section 1, Introduction; Section 3, Template FlowSet Format
The authoritative specification for NetFlow v9 format referenced in section 18.2. Defines the flow record structure, template mechanism, and export protocol that network analysis tools implement.
RFC 3176, InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks
Section 2, Overview; Section 3, sFlow Datagram
The specification for sFlow sampled flow monitoring referenced in section 18.2. Explains the sampling model that makes sFlow applicable to high-speed links where per-flow counting is impractical.
OpenTelemetry Specification, Semantic Conventions for Networking
Semantic Conventions: Network, Transport
Defines how network attributes (peer address, transport, DNS query duration) are recorded in OTel traces and metrics. Referenced in section 18.3 on cross-service correlation.
Facebook Engineering Blog: More details about the October 4 outage
Published October 4, 2021
Facebook's post-incident analysis of the BGP withdrawal that caused the six-hour global outage. Confirms that internal tooling ran on the affected infrastructure and that physical access was required to diagnose the issue. Used as the opening case study.
The Four Golden Signals; Symptoms Versus Causes
The Google SRE guidance on alert design, specifically the principle that alerts should be tied to user-facing symptoms with clear remediation paths. Supports the alert fatigue discussion in section 18.4.
Signals narrow the explanation space. Module 19 teaches how to use the sharpest signal of all, packet captures, safely and deliberately: scoping collection, writing filters, and respecting data minimisation.
Module 18 of 21 · Practice-Strategy