Digital and Cloud Scale Architecture · Module 3

Resilience and performance under failure

Caching helps, but it creates new risks.

55 min 3 outcomes Software Development and Architecture Advanced

Previously

Advanced patterns and distribution

Glossary Tip.

This module

Resilience and performance under failure

Caching helps, but it creates new risks.

Next

Evolution and governance

Glossary Tip.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

A dependency slows down.

What you will be able to do

  • 1 Explain why timeouts and retries can increase harm
  • 2 Describe backpressure and graceful degradation in plain terms
  • 3 Define what you will monitor to spot saturation early

Before you begin

  • Comfort with earlier modules in this track
  • Ability to explain trade-offs and risks without jargon

Common ways people get this wrong

  • Cascading failure. One failure can spread quickly without isolation and backpressure.
  • Optimising the mean. Optimising the average can hide tail latency that users feel.

Main idea at a glance

Resilience mesh

Protect service calls with layered controls and clear fallback logic.

Stage 1

User request

User initiates a request that depends on downstream services.

I think users expect requests to either work or fail cleanly, not hang.

Resilience control loop for dependency failure

Caching helps, but it creates new risks. Always decide where stale data is acceptable.

Worked example. Retries turned a small outage into a full incident

Worked example. Retries turned a small outage into a full incident

A dependency slows down. Callers timeout and retry with no jitter. Load multiplies, queues fill, and what started as “a bit slow” becomes total failure. This is why resilience is a system property, not a library checkbox.

Common mistakes in resilience

Resilience mistakes that cause major incidents

Most cascading failures come from unbounded retry and missing load controls.

  1. Retrying every failure

    Only retry transient errors and always enforce retry budgets with jitter.

  2. Missing circuit breaker controls

    Open circuits quickly when dependency health drops to prevent full-system cascades.

  3. No backpressure strategy

    Use queue limits, admission control, or graceful degradation to survive overload safely.

Verification. A resilience review in five questions

Resilience review checklist

Check these five controls before production release.

  1. Timeout and retry budget

    Define timeout boundaries and cap retries per request path.

  2. Idempotency guarantee

    Confirm retries cannot create duplicate side effects or data corruption.

  3. Safe fallback behaviour

    Specify how the user journey degrades when dependencies fail.

  4. Saturation detection

    Alert on queue growth, timeout ratio, and error budget burn rate.

  5. Rollback and recovery

    Document rollback triggers and the exact rollback command path.

Reflection prompt

Where do timeouts or retries make things worse in your current system.

Mental model

Resilience and performance

Resilience is how you behave on bad days. Performance is how you behave on normal days.

  1. 1

    Load

  2. 2

    System

  3. 3

    Latency

  4. 4

    Errors

  5. 5

    Actions

Assumptions to keep in mind

  • Budgets exist. Budgets make trade-offs explicit and keep systems stable.
  • Degradation is designed. Degrade gracefully instead of failing catastrophically.

Failure modes to notice

  • Cascading failure. One failure can spread quickly without isolation and backpressure.
  • Optimising the mean. Optimising the average can hide tail latency that users feel.

Check yourself

Quick check. Resilience and scale

0 of 8 opened

Why use circuit breakers

To stop failure cascades when dependencies are down.

What is backpressure

Slowing or shedding load to protect the system.

Why can retries be dangerous

They can multiply load during outages.

Where should caches sit

Close to reads that need speed but can accept staleness.

What is graceful degradation

Continuing with reduced features instead of total failure.

Why plan for scale early

Because traffic patterns rarely stay small.

What is a simple scaling model

Capacity equals throughput per node times number of nodes.

What should you monitor in scale tests

Latency, error rate, and saturation.

Artefact and reflection

Artefact

A resilience checklist for one critical user journey

Reflection

Where in your work would explain why timeouts and retries can increase harm change a decision, and what evidence would make you trust that change?

Optional practice

Adjust timeouts and retries to see the risk balance.

Source ISO/IEC/IEEE 42010:2022 architecture description standard
Source ISO/IEC 25010:2023 software quality model standard
Source C4 Model (reference framework for communicating architecture)
Source arc42 architecture documentation template (reference framework)