MODULE 14 OF 6 · APPLIED

Resilience and Fault Tolerance

30 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

Explain the eight fallacies of distributed computing and their architectural implications
Apply the circuit breaker pattern to prevent cascading failures in distributed systems
Design bulkheads to limit the blast radius of a failing service
Specify timeout and retry strategies appropriate to different failure modes

Server infrastructure (photo from Unsplash)

Real-world incident · August 19, 2008

Netflix loses a database. 3 days offline. The architecture that emerged.

On August 19, 2008, a database corruption event at Netflix made it impossible to ship DVDs for three days. The entire Netflix operation ran on a single, centralised Oracle database. When that database failed, there was no fallback. The engineering team had no tools to diagnose, isolate, or recover from the failure in a timely way.

The incident triggered a multi-year re-architecture programme. Netflix moved from a monolithic application with a single database to a microservices architecture deployed across Amazon Web Services. But the more significant change was the engineering philosophy that emerged from the experience: assume that everything will fail, and design the system to continue operating in a degraded state when it does.

Netflix engineers built Hystrix, the circuit breaker library that became the industry standard for fault tolerance in Java services. They also built Chaos Monkey, a tool that randomly terminates production instances to ensure the system continued working even when components were forcibly removed. The principle they articulated, later called "chaos engineering," was the same one that Peter Deutsch and colleagues had stated in 1994: the network is not reliable, and assuming otherwise produces systems that fail catastrophically rather than gracefully.

If your entire service depends on a single database and that database becomes unavailable, what happens to every user? What if you designed the system to assume the database would fail?

With the learning outcomes established, this module begins by examining the fallacies of distributed computing in depth.

14.1 The fallacies of distributed computing

In 1994, Peter Deutsch at Sun Microsystems documented a set of assumptions that developers routinely make about distributed systems and that are reliably false. These became known as the eight fallacies of distributed computing. Understanding them is the starting point for resilience engineering, because the failure modes that produce real production incidents are almost always a consequence of one of these assumptions being violated.

The most consequential fallacies for system design are: the assumption that the network is reliable (packets are lost, connections drop, switches fail); the assumption that latency is zero (cross-data-centre calls add 10 to 100 milliseconds, and any service call can become arbitrarily slow under load); and the assumption that the network topology does not change (services restart, scale in and out, and IP addresses change in dynamic cloud environments).

A system designed on these assumptions behaves correctly in development and staging, where the network is local and reliable, and fails unpredictably in production. Resilience engineering means designing the system to continue functioning at a reduced level of service when these assumptions are violated, rather than failing completely.

“The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology does not change. There is one administrator. Transport cost is zero. The network is homogeneous. All of these are false.”
Peter Deutsch et al. - The Eight Fallacies of Distributed Computing. Sun Microsystems, 1994-1997
Deutsch's list is not a historical curiosity; every item maps directly to a real failure mode in modern cloud systems. Network partitions, latency spikes under load, dynamic Kubernetes pod scheduling, and multi-team infrastructure ownership are everyday realities. The fallacies are design constraints, not edge cases.

With an understanding of the fallacies of distributed computing in place, the discussion can now turn to circuit breakers, which builds directly on these foundations.

14.2 Circuit breakers

When a downstream service becomes slow or unavailable, every request to it blocks a thread waiting for a response that may take 30 seconds to time out. If the calling service has 50 threads and 50 requests are blocked waiting for the failing service, no threads are available to handle any other requests, including requests that have nothing to do with the failing service. The entire calling service becomes unresponsive. This is cascading failure.

The circuit breaker pattern, coined by Michael Nygard in Release It!(2007) and popularised by Netflix Hystrix, addresses this by monitoring the failure rate of calls to a downstream service and opening the circuit when failures exceed a threshold.

The circuit has three states. In the closed state, calls pass through normally and the failure rate is tracked. When the failure rate exceeds the threshold (for example, five failures in ten seconds), the circuit opens. In the open state, calls fail immediately without attempting the network call, returning a cached fallback or an explicit error. After a configurable timeout, the circuit enters the half-open state, allowing a single test request through. If the test request succeeds, the circuit closes. If it fails, the circuit opens again.

The circuit breaker converts a slow failure (waiting 30 seconds for a timeout) into a fast failure (returning an error immediately). This frees threads to serve other requests and allows the calling service to return a degraded response rather than blocking entirely.

Common misconception

“Retrying a failed request is equivalent to a circuit breaker.”

Retries and circuit breakers solve different problems. Retries handle transient failures: a network glitch that resolves in milliseconds. Circuit breakers handle persistent failures: a service that has been degraded for minutes. Retrying a call to a service that has been failing for 60 seconds adds load to an already struggling service and blocks the caller longer. A circuit breaker detects the persistent failure and stops making calls entirely, giving the downstream service time to recover without being hammered with retries.

With an understanding of circuit breakers in place, the discussion can now turn to bulkheads, which builds directly on these foundations.

14.3 Bulkheads

The bulkhead pattern takes its name from the compartmentalised hull sections of a ship. If one compartment floods, the bulkheads prevent water from reaching the others. In software, a bulkhead isolates failure by ensuring that a problem with one downstream dependency cannot exhaust the resources (threads, connections) used by all other operations.

The implementation is separate resource pools per downstream dependency. Rather than a single shared thread pool handling all outgoing calls, each downstream service gets its own pool. If the Payment service pool fills with blocked threads waiting for a slow response, the Account service pool is unaffected and continues to serve requests normally.

In practice, bulkheads are implemented using separate connection pools (limiting the number of concurrent HTTP connections to each downstream service independently), separate thread pools in thread-per-request models, or separate async worker pools in event loop models. The key parameter is the limit: each pool should be sized to the capacity of the downstream service under normal load, not to the maximum concurrency of the calling service.

With an understanding of bulkheads in place, the discussion can now turn to timeouts, retries, and graceful degradation, which builds directly on these foundations.

14.4 Timeouts, retries, and graceful degradation

Every network call must have an explicit timeout. A call without a timeout blocks indefinitely if the downstream service never responds. In a microservices system, an indefinitely blocking call propagates up the call chain: the Checkout service calls the Inventory service, which calls the Pricing service, which has no timeout on its database call. A slow database query blocks the entire call chain.

Timeouts should be set at each layer independently, and the outer timeout should be longer than the inner one to allow for retries. A user-facing API should timeout in 2 to 5 seconds; an internal service call should timeout in 500 milliseconds to 2 seconds; a database query should timeout in 100 to 500 milliseconds.

Retries are appropriate for transient failures: temporary network interruptions that resolve within milliseconds. Use exponential backoff with jitter, increasing the delay between each retry by a factor and adding random variation to prevent all retrying clients from hitting the downstream service simultaneously (the thundering herd problem). Do not retry non-idempotent operations without idempotency controls: retrying a payment creation request without an idempotency key charges the customer twice.

Graceful degradation means serving a reduced-quality response rather than failing completely. When the Recommendation service is unavailable, the product page shows popular items from cache instead of personalised recommendations. Define explicit fallback strategies for every external dependency before the dependency fails in production, not after.

“The stability patterns exist to prevent failures in one part of a system from cascading into other parts. They act as bulkheads, circuit breakers, and timeouts to protect threads and connections.”
Michael Nygard - Release It! Design and Deploy Production-Ready Software, 2nd edition, 2018
Nygard's book documented these patterns from real production incidents before they had names. The unifying principle is that failure is inevitable; the question is whether a failure in one component takes down the entire system or is contained. The patterns in this module (circuit breaker, bulkhead, timeout) all serve the same goal: containment.

14.5 Check your understanding

A payment service calls three downstream services: fraud detection (50ms average), account service (30ms average), and notification service (200ms average). Under load, the notification service degrades to 10-second responses, causing the payment service to stop responding to all requests. Which two patterns most directly address this?

A team is implementing retries for an order placement API. The endpoint creates a new order in the database and charges the customer's card. What must be in place before retries can be safely enabled?

Which fallacy of distributed computing is most directly violated when a service call has no explicit timeout?

Netflix uses Chaos Monkey to randomly terminate production instances. What assumption does this practice directly test and disprove?

Having covered the key resilience patterns, this section provides an interactive diagram to consolidate understanding of how circuit breakers, bulkheads, and timeouts work together to contain failure in distributed systems.

Explore the concepts interactively

Use this interactive diagram to explore the concepts discussed in this module. Click on elements to see how they relate to each other and to the patterns covered above.

Loading interactive component...

Check your understanding

A service calls a downstream dependency that is responding with 50% errors. The circuit breaker trips to the Open state. A new request arrives. What happens?

Key takeaways

The eight fallacies of distributed computing document the assumptions that make systems fragile. Design against them: the network is not reliable, latency is not zero, topology changes, and services will fail.
Circuit breakers prevent cascading failures by converting slow failures into fast failures. Three states: closed (normal), open (fail fast), half-open (test recovery). Coined by Nygard, popularised by Netflix Hystrix.
Bulkheads isolate failures by giving each downstream dependency its own resource pool. A slow service can only exhaust its own pool, not all pools.
Every network call must have an explicit timeout. Retries require exponential backoff with jitter and idempotency controls for non-idempotent operations.
Design explicit fallback strategies for every external dependency before those dependencies fail in production. Graceful degradation means returning a reduced-quality response, not failing completely.

Standards and sources cited in this module

Nygard, M. (2018). Release It! Design and Deploy Production-Ready Software. 2nd edition. Pragmatic Bookshelf.
Part II: Stability
The source of the circuit breaker and bulkhead patterns. The quote in Section 14.4 is from Chapter 5. The Netflix 2008 incident and the patterns that emerged from it are drawn from the resilience engineering tradition Nygard established.
Deutsch, P. et al. The Eight Fallacies of Distributed Computing. Sun Microsystems, 1994-1997.
Full list
The foundational text on distributed computing assumptions. Quoted in Section 14.1 to establish the failure modes that resilience patterns are designed to handle. The fallacies are used throughout the module as the conceptual framework for why each pattern exists.
Netflix Technology Blog. Fault Tolerance in a High Volume, Distributed System. January 2012.
Full post
Netflix's own account of the Hystrix circuit breaker design and the production failure modes it addresses. Used in the opening story and to support the circuit breaker description in Section 14.2.
Principles of Chaos Engineering
Full document
The formal definition of chaos engineering, drawing directly from Netflix's Chaos Monkey practice described in the opening story. Referenced in Section 14.1 for the principle of designing systems to assume failure.

What comes next: Resilience handles failure. Scalability handles growth. Module 15 covers horizontal and vertical scaling, caching, sharding, load balancing, and the practical application of the CAP theorem - illustrated by Instagram’s architecture journey from 2 founders to 100 million users.

Previous: Domain-Driven Design Essentials Next: Scalability Patterns

Module 14 of 22 in Applied