MODULE 22 OF 7 · PRACTICE AND STRATEGY

Chaos Engineering and Trade-Off Analysis

30 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

  • Explain chaos engineering principles and design a well-scoped chaos experiment
  • Identify the prerequisites required before running chaos experiments in production
  • Apply structured trade-off analysis to an architectural decision
  • Use the Architecture Trade-off Analysis Method (ATAM) framework to identify sensitivity and trade-off points
Server motherboard with circuit components under dramatic lighting (photo on Unsplash)

Real-world case · Netflix, 2011

Netflix ran Chaos Monkey in production for three years before competitors considered it safe to try in staging.

In 2011 Netflix was three years into its migration from datacentre to AWS. The engineering team had designed every service to survive the failure of any single instance. They had circuit breakers, automatic restarts, and load balancer health checks. They believed the system was resilient. Belief was not enough.

Yury Izrailevsky and Ariel Tseitlin published the Chaos Monkey announcement in December 2011. The tool randomly selected and terminated EC2 instances in the Netflix production environment during normal business hours, Monday through Friday. The logic was explicit: if the team was not confident the system survived random instance termination on a Tuesday afternoon, they would find out about the weakness in a controlled experiment rather than during a real failure at 2am on a Saturday.

The Simian Army that followed added Chaos Gorilla (terminates an entire Availability Zone), Chaos Kong (simulates a full region failure), and Latency Monkey (injects artificial delays). By 2012 Netflix treated chaos experiments as routine engineering practice. By 2019 the Principles of Chaos Engineering had been formalised and adopted industry-wide.

Netflix's Chaos Monkey randomly terminated production EC2 instances during business hours. The engineers believed their system was resilient to single-instance failures because they had designed it to be. Chaos Monkey was the test of that belief. What does it mean to test a belief about resilience, versus simply holding it?

With the learning outcomes established, this module begins by examining the principles of chaos engineering in depth.

22.1 The principles of chaos engineering

Chaos engineering is the discipline of experimenting on a system in order to build confidence in its capability to withstand turbulent conditions in production. The distinction from testing is important: testing verifies that code does what it is supposed to do. Chaos engineering verifies that the system behaves as expected when infrastructure fails around it.

The Principles of Chaos Engineering, published by Netflix engineers in 2019, define five principles. First, build a hypothesis around steady state: define a measurable normal state before injecting failures. Second, vary real-world events: inject failures that mirror actual production failure modes, not artificial scenarios. Third, run experiments in production: staging environments lack production traffic patterns and production dependencies. Fourth, automate experiments continuously: manual chaos tests are one-time; automated experiments catch regressions. Fifth, minimise blast radius: start with the smallest possible failure that would still reveal the weakness.

The fifth principle is the safety net. Terminating one instance is a smaller blast radius than terminating an entire Availability Zone. Start small. Expand scope only after the smaller experiment confirms the hypothesis.

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.

Basiri, A. et al. (2016) - Chaos Engineering. IEEE Software, 33(3):35-41

The word 'discipline' is deliberate. Chaos engineering is not randomly breaking things. It is a structured practice with hypotheses, defined steady states, controlled injection, observation, and conclusions. The Netflix paper that formalised the field explicitly rejected the characterisation of chaos as careless disruption. Every experiment has an abort condition.

With an understanding of the principles of chaos engineering in place, the discussion can now turn to designing a chaos experiment, which builds directly on these foundations.

22.2 Designing a chaos experiment

Every chaos experiment follows the same five-step structure. Define the steady state: a measurable description of normal system behaviour, expressed using SLIs from Module 18 (for example, 99.95% of payment requests succeed, p99 latency under 300 milliseconds). State the hypothesis: a specific prediction about how the system will behave when the failure is injected. Inject the failure: using a controlled mechanism with a defined scope and duration. Observe: monitor the SLIs throughout the experiment. Conclude: compare actual behaviour to the hypothesis.

Consider an experiment testing the payment service's dependency on Redis. The steady state is 99.95% payment success rate and p99 latency under 300 milliseconds. The hypothesis is that if Redis becomes unavailable, the payment service will fall back to database session lookups, success rate will remain above 99%, and latency will increase but stay under 1,000 milliseconds. The failure injected is blocking all Redis connections from the payment service for five minutes using iptables rules. The abort condition is triggered if the success rate drops below 95% or latency exceeds 5,000 milliseconds. The experiment is terminated and the block removed immediately if either abort condition is met.

A confirmed hypothesis builds confidence. A rejected hypothesis reveals a weakness that would have caused an outage under real failure conditions. Both outcomes are valuable; a rejected hypothesis is arguably more valuable because it reveals a weakness before it causes unplanned downtime.

Common misconception

Chaos experiments are safe to run without preparation because they are designed to be small.

Chaos experiments without defined steady states produce uninterpretable results: you cannot tell whether the hypothesis was confirmed or rejected if you did not define what normal looks like before injecting the failure. Without distributed tracing, you cannot diagnose failures during the experiment. Without abort conditions, a failed experiment becomes a production incident. The prerequisites must be in place before the first experiment runs.

With an understanding of designing a chaos experiment in place, the discussion can now turn to prerequisites before running chaos experiments, which builds directly on these foundations.

22.3 Prerequisites before running chaos experiments

Chaos engineering on a system without good observability does not reveal weaknesses; it just causes problems. The following prerequisites must be in place before any production chaos experiment.

Defined SLIs and SLOs with real-time monitoring (from Module 18) are required to establish the steady state and to measure the impact of the failure injection. Distributed tracing is required to diagnose what is happening inside the system during the experiment. An on-call engineer must be monitoring the system throughout the experiment window, ready to trigger the abort condition manually if automated thresholds do not catch the problem quickly enough.

A documented abort condition (the specific SLI threshold at which the experiment terminates and the failure is reversed) must be defined before the experiment runs, not during it. Scheduled change freeze periods must exclude chaos experiments: running an experiment during peak trading hours without explicit approval and stakeholder notification is irresponsible. Stakeholders must know an experiment is running so that unexpected customer impact is not mistaken for a production incident.

Netflix began using Chaos Monkey as a way to make sure that we can handle these failures without any customer impact. After many years of using these tools, we have gotten to a point where engineers actually expect their services to be tested in this way.

Izrailevsky, Y. and Tseitlin, A. (2011) - The Netflix Simian Army. Netflix Tech Blog, July 2011

The phrase 'expect their services to be tested in this way' describes a cultural outcome, not just a technical one. Teams that have been running chaos experiments for years design for failure from the start because they know the experiment will come. The discipline changes how engineers think about resilience during development, not only during incident response.

With an understanding of prerequisites before running chaos experiments in place, the discussion can now turn to trade-off analysis: atam, which builds directly on these foundations.

22.4 Trade-off analysis: ATAM

Architecture trade-off analysis ensures that significant architectural decisions are evaluated against quality attribute requirements before being implemented. The Architecture Trade-off Analysis Method (ATAM) was developed at the Software Engineering Institute (SEI) in the 1990s. It identifies three types of architectural elements: sensitivity points (where a change in the architecture significantly affects a quality attribute), trade-off points (where improving one quality attribute degrades another), and risks (decisions that may prove problematic given uncertain requirements).

Applying ATAM to the decision of synchronous versus asynchronous communication between the order service and the notification service: a synchronous call makes the order service's latency sensitive to the notification service's response time (a sensitivity point). Switching to asynchronous communication improves the order service's performance and availability but reduces consistency: notifications may be delivered seconds after the order completes (a trade-off point). The risk is that asynchronous delivery requires dead-letter queues and retry logic; if these are not implemented correctly, notifications are silently lost.

For everyday decisions, a full ATAM exercise is not warranted. A structured comment in an Architecture Decision Record achieves the same purpose: document what you gain (the improvement), what you accept (the degraded quality attribute), and the sensitivity points (the thresholds at which the trade-off becomes problematic).

ATAM is first and foremost a risk-identification method. Architectural risks are architectural decisions that might compromise a system's quality attributes.

Bass, L., Clements, P., Kazman, R. (2021) - Software Architecture in Practice, 4th Edition. Addison-Wesley, Chapter 21

The framing as risk identification rather than optimisation is important. ATAM does not produce the 'optimal' architecture. It produces an explicit record of which quality attributes were prioritised, which were accepted as degraded, and where the risks lie. An architecture with documented trade-offs and known risks is more maintainable than an optimised architecture whose trade-offs were never examined.

22.5 Check your understanding

A team wants to run their first chaos experiment: terminate all database instances for the user profile service for 5 minutes during business hours. What is wrong with this design?

Write a hypothesis for a better-scoped version of the experiment: the user profile service, with one database replica terminated.

In ATAM terminology, what is a trade-off point?

Which of the five Principles of Chaos Engineering is violated by running the same chaos experiment manually once per quarter?

Key takeaways

  • Chaos engineering validates resilience assumptions by injecting controlled failures in production. It requires defined steady state, good observability, and abort conditions before it is safe to run.
  • Every chaos experiment follows a five-step cycle: define steady state, state hypothesis, inject failure, observe SLIs, conclude. A rejected hypothesis is valuable: it reveals a weakness before it causes unplanned downtime.
  • Prerequisites for chaos engineering: SLI/SLO monitoring, distributed tracing, on-call coverage during the experiment, documented abort conditions, and stakeholder notification. Missing any of these converts an experiment into an incident.
  • ATAM identifies sensitivity points (where architecture strongly affects a quality attribute) and trade-off points (where improving one quality attribute degrades another). Documenting these in Architecture Decision Records makes trade-offs explicit and auditable.
  • For everyday decisions, lightweight trade-off documentation in an ADR (what you gain, what you accept, sensitivity thresholds) achieves the purpose of ATAM without the full formal process.

Standards and sources cited in this module

  1. Principles of Chaos Engineering

    Five principles; community charter

    The five principles cited throughout Section 22.1 are drawn directly from this document, published by Netflix engineers. The foundational reference for the field.

  2. Basiri, A., Behnam, N., de Rooij, R. et al. (2016). Chaos Engineering. IEEE Software, 33(3):35-41

    Full paper

    The academic paper formalising Netflix's chaos engineering practice. Quoted in Section 22.1 for the definition of chaos engineering as a discipline. Provides the historical context for how Chaos Monkey evolved into a rigorous methodology.

  3. Bass, L., Clements, P., Kazman, R. (2021). Software Architecture in Practice, 4th ed. Addison-Wesley

    Chapter 21: Architecture Trade-off Analysis Method (ATAM)

    Defines ATAM and the quality attribute scenario approach to trade-off analysis. Quoted in Section 22.4 for the characterisation of ATAM as a risk-identification method.

  4. Izrailevsky, Y. and Tseitlin, A. (2011). The Netflix Simian Army. Netflix Tech Blog

    Original announcement post

    The original public description of Chaos Monkey and the Simian Army. Quoted in Section 22.3 for the cultural outcome of teams expecting their services to be tested.

What comes next: You have completed all 22 modules. Return to the course overview to review your progress, attempt the final assessment, or explore the interactive tools in the Software Architecture studio.

Module 22 of 22 in Practice and Strategy