Module 18 of 21 · Practice-Strategy

Which observability signals actually help

17 min read 4 outcomes Scenario quiz

By the end of this module you will be able to:

Distinguish logs, metrics, and distributed traces as signal types with different strengths and appropriate use cases
Explain when to use SNMP polling, NetFlow/sFlow export, or application-level telemetry for a given network question
Describe how OpenTelemetry (OTel) unifies signal collection across infrastructure and applications
State why adding more signals without reducing noise typically worsens incident response rather than improving it

Real-world incident · October 2021

Facebook's six-hour outage: when the monitoring tools vanish with the network

On October 4, 2021, a routine BGP (Border Gateway Protocol) configuration change at Facebook withdrew all of Facebook's BGP route announcements from the global internet. The domains facebook.com, instagram.com, and WhatsApp stopped resolving for approximately six hours, affecting roughly 3.5 billion users.

The diagnosis took longer than the fix. Facebook's internal monitoring and communication tools, including the dashboards and alerting systems engineers relied on, were hosted on the same infrastructure that had just become unreachable from the internet. Engineers physically travelled to data centres because remote access via VPN also traversed the broken paths. External synthetic monitors flagged the issue within minutes, but the internal tooling that would explain the cause was inaccessible.

The incident illustrated a critical property of observability that is easy to overlook: the signal pipeline itself must remain available during the incident it is meant to diagnose. Signals hosted on the same infrastructure as the affected service will fail at the same time as the service. The organisations with the fastest recovery times had a mix of internal and external signal sources, and their alerting systems ran on separate, independently routed infrastructure.

Facebook engineers had metrics, logs, and dashboards. Why did it take hours to diagnose a BGP misconfiguration that removed the company from the internet?

Choose observability signals by the question they answer

Metrics, logs, traces, flow records, packet captures, synthetic checks, and user reports see different slices.

Pick the signal that best answers the question you actually have. Logs explain what happened. Metrics explain how often. Traces explain where time went. Flow records explain who talked to whom. Capture shows the wire.

18.1 The three pillars: logs, metrics, and traces

Module 17 placed security controls at the layer where threats form. This module applies the same discipline to observability: choosing signals based on the specific question you are trying to answer. The three foundational signal types are logs, metrics, and distributed traces, and they are not interchangeable.

A log is a timestamped record of a discrete event: a packet was dropped, a connection was refused, a user authenticated. Logs are high in detail and low in aggregation. They answer "what happened and when?" but become expensive to query at scale. A busy firewall can produce millions of log entries per hour; searching them during an incident under time pressure is difficult without pre-built indexes.

A metric is a numeric measurement sampled over time: packets per second, CPU utilisation, connection table fill percentage, error rate. Metrics are low in detail and high in aggregation. They answer "is this number changing?" and can be graphed and alerted on efficiently. The cost is loss of context: a spike in error rate tells you something is wrong but not which specific request failed or why.

A distributed trace follows a single request across multiple services, recording timing and outcome at each hop. Traces answer "where did this specific request spend its time?" They are essential for latency diagnosis in distributed systems but require instrumentation at every service that touches the request.

Logs tell you what happened. Metrics tell you how often and how fast. Traces tell you where a specific request went and how long each hop took. Choosing the wrong signal type does not mean the answer is unavailable; it means you are paying a higher cost to find it.

The three pillars apply to applications and infrastructure generally. Network devices have their own specific signal types that map onto those pillars in distinct ways.

18.2 Network-specific signals: SNMP, NetFlow, and sFlow

Network devices produce their own distinct signal types. The Simple Network Management Protocol (SNMP) is the oldest and most widely supported. An SNMP manager polls managed devices at regular intervals, reading counters from the Management Information Base (MIB): interface utilisation, error counts, neighbour state, hardware status. SNMP traps provide the reverse direction: a device sends an unsolicited notification when a threshold is crossed.

NetFlow is a flow export family developed by Cisco; RFC 3954 documents NetFlow Version 9 as an Informational RFC. The IETF standards-track successor is IPFIX, defined in RFC 7011 with the information model in RFC 7012. A network device observes traffic passing through it and produces flow records: source IP, destination IP, source port, destination port, protocol, byte count, and packet count for each distinct flow. Flow records are exported to a collector where they can be queried to answer questions like "which destinations is this server talking to?" or "what is consuming the most bandwidth on this link?" Flow data does not capture packet content; it captures conversation metadata.

sFlow (RFC 3176) takes a different approach: it samples a fraction of packets (for example, one packet in every 1,000) and exports the sampled headers. This scales to higher link speeds where counting every flow would overwhelm the device, at the cost of statistical accuracy rather than exact counts. Both NetFlow and sFlow are more useful than SNMP for traffic pattern analysis because they show per-flow behaviour, not just interface totals.

QUIC changes the balance again. Flow records still show endpoints, ports, volumes, and timing, but QUIC encrypts most transport control information. For HTTP/3 and other QUIC traffic, endpoint telemetry and application logs become more important because the network sees less than it did with TCP.

“This document describes the NetFlow export format, version 9. The export format records information about flows observed at a network device.”
RFC 3954 - Section 1, Introduction
RFC 3954 documents Cisco's NetFlow v9 export format as Informational. IPFIX, defined in RFC 7011 with its information model in RFC 7012, is the standards-track IETF flow export protocol. The shared operational idea is flow metadata: who talked to whom, on which ports, for how many packets and bytes.

SNMP and flow data cover the network layer. OpenTelemetry is the framework that unifies collection across all three signal types and all layers of the stack.

18.3 OpenTelemetry: one collection framework for all three signals

OpenTelemetry (OTel) is a Cloud Native Computing Foundation (CNCF) project that defines a vendor-neutral API, SDK, and wire protocol for collecting logs, metrics, and traces. Before OTel, instrumenting an application meant integrating separate libraries for each signal type, each with its own format and export destination. Switching observability backends required re-instrumenting the application.

The OTel Collector is an agent that can receive telemetry in multiple formats (including Prometheus metrics, Jaeger traces, and OpenCensus data), process it, and export it to any supported backend. This means the application code calls OTel APIs; the routing decision (which backend receives the data) is a configuration decision, not a code change. For network engineers, the OTel semantic conventions define how network-related attributes (peer address, transport protocol, DNS query time) should be recorded in traces and metrics, making cross-service correlation possible.

Collecting signals is only valuable if the signals you collect lead to useful alerts. The challenge of choosing what to alert on is where most teams go wrong.

18.4 Alert fatigue and choosing the right signal

Alert fatigue occurs when a system produces so many alerts that engineers stop treating them as meaningful signals. The typical cause is not a lack of metrics but a surplus of metrics without clear thresholds, combined with alerting on symptoms that have no direct remediation path. A CPU alert that fires every day because the system is working normally at high load is noise; it trains engineers to ignore alerts, which is exactly the condition that causes real incidents to go undetected.

The correct discipline is to start with the user-facing question: "Is the service responding within the agreed time?" Then trace backwards: which intermediate signal predicts or explains a user-visible failure? Alert on that signal, with a threshold that demands a human response when crossed. Infrastructure signals (disk, CPU) should be recorded as metrics for diagnosis but not alert unless they are on a direct causal path to a user-visible failure.

The Facebook outage also illustrated signal resilience: signals that run on the same infrastructure as the service they observe will fail alongside that service. External synthetic monitoring (scripted requests from independent vantage points) can detect user-facing failures even when all internal tooling is dark. Both are necessary.

Common misconception

“More metrics means better observability.”

More metrics without clear purpose increases noise and accelerates alert fatigue. Effective observability starts with the user-facing question you need to answer, then identifies the minimum set of signals that reliably predict or explain failures at that level. Adding metrics that are never actioned, or that alert without a clear remediation path, degrades incident response rather than improving it.

18.5 Check your understanding

Your network team reports that bandwidth on an internet-facing link has been unusually high for two days. Which signal type would most efficiently tell you which destinations are consuming the bandwidth?

A distributed web service is intermittently slow. Error rates are normal and infrastructure metrics are healthy. Users report that specific pages take 8-12 seconds to load instead of under 1 second. Which signal type is most likely to reveal where the latency is being added?

An operations team has 47 different alert rules configured. Engineers acknowledge most alerts without investigating because so many fire during normal business hours. What is the primary cause, and what should be changed?

Core distinctions

Logs, metrics, and distributed traces answer different questions. Choose the signal type based on the question, not on what is easiest to collect.
SNMP gives interface and device-level counters. NetFlow gives per-flow conversation metadata. sFlow gives sampled packet headers at scale. Each suits a different network question.
RFC 3954 documents NetFlow v9 as Informational. IPFIX in RFC 7011 and RFC 7012 is the IETF standards-track flow export family.
QUIC reduces visible transport detail on the wire. Flow data remains useful for volume and endpoint questions, but endpoint telemetry is more important for HTTP/3 diagnosis.
OpenTelemetry provides a vendor-neutral API and SDK for collecting logs, metrics, and traces without re-instrumenting when you change backends.
Alert fatigue is caused by alerts that fire without requiring human action. Fix it by auditing each alert against user-visible failure and remediation path, not by adding more tooling.
Signal pipeline resilience matters: internal monitoring that runs on the same infrastructure as the service it observes will fail during the same incident. External synthetic monitoring provides independent visibility.

Standards and sources cited in this module

RFC 3954, Cisco Systems NetFlow Services Export Version 9
Section 1, Introduction; Section 3, Template FlowSet Format
The authoritative specification for NetFlow v9 format referenced in section 18.2. Defines the flow record structure, template mechanism, and export protocol that network analysis tools implement.
RFC 7011, Specification of the IP Flow Information Export Protocol
Full specification
Defines IPFIX as the standards-track flow export protocol. Referenced in section 18.2.
RFC 7012, Information Model for IP Flow Information Export
Full specification
Defines the IPFIX information elements that make flow records interoperable. Referenced in section 18.2.
RFC 9312, Manageability of the QUIC Transport Protocol
Operational wire image discussion
Explains why QUIC hides transport details that TCP exposed to middleboxes and passive monitoring. Referenced in section 18.2.
RFC 3176, InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks
Section 2, Overview; Section 3, sFlow Datagram
The specification for sFlow sampled flow monitoring referenced in section 18.2. Explains the sampling model that makes sFlow applicable to high-speed links where per-flow counting is impractical.
OpenTelemetry Specification, Semantic Conventions for Networking
Semantic Conventions: Network, Transport
Defines how network attributes (peer address, transport, DNS query duration) are recorded in OTel traces and metrics. Referenced in section 18.3 on cross-service correlation.
Facebook Engineering Blog: More details about the October 4 outage
Published October 4, 2021
Facebook's post-incident analysis of the BGP withdrawal that caused the six-hour global outage. Confirms that internal tooling ran on the affected infrastructure and that physical access was required to diagnose the issue. Used as the opening case study.
Google SRE Book
The Four Golden Signals; Symptoms Versus Causes
The Google SRE guidance on alert design, specifically the principle that alerts should be tied to user-facing symptoms with clear remediation paths. Supports the alert fatigue discussion in section 18.4.

Signals narrow the explanation space. Module 19 teaches how to use the sharpest signal of all, packet captures, safely and deliberately: scoping collection, writing filters, and respecting data minimisation.

Back: Security controls by layer Next: Packet captures