Digital and Cloud Scale Architecture · Module 3
Resilience and performance under failure
Caching helps, but it creates new risks.
Previously
Advanced patterns and distribution
Glossary Tip.
This module
Resilience and performance under failure
Caching helps, but it creates new risks.
Next
Evolution and governance
Glossary Tip.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
A dependency slows down.
What you will be able to do
- 1 Explain why timeouts and retries can increase harm
- 2 Describe backpressure and graceful degradation in plain terms
- 3 Define what you will monitor to spot saturation early
Before you begin
- Comfort with earlier modules in this track
- Ability to explain trade-offs and risks without jargon
Common ways people get this wrong
- Cascading failure. One failure can spread quickly without isolation and backpressure.
- Optimising the mean. Optimising the average can hide tail latency that users feel.
Main idea at a glance
Resilience mesh
Protect service calls with layered controls and clear fallback logic.
Stage 1
User request
User initiates a request that depends on downstream services.
I think users expect requests to either work or fail cleanly, not hang.
Resilience control loop for dependency failure
Caching helps, but it creates new risks. Always decide where stale data is acceptable.
Worked example. Retries turned a small outage into a full incident
Worked example. Retries turned a small outage into a full incident
A dependency slows down. Callers timeout and retry with no jitter. Load multiplies, queues fill, and what started as “a bit slow” becomes total failure. This is why resilience is a system property, not a library checkbox.
Common mistakes in resilience
Resilience mistakes that cause major incidents
Most cascading failures come from unbounded retry and missing load controls.
-
Retrying every failure
Only retry transient errors and always enforce retry budgets with jitter.
-
Missing circuit breaker controls
Open circuits quickly when dependency health drops to prevent full-system cascades.
-
No backpressure strategy
Use queue limits, admission control, or graceful degradation to survive overload safely.
Verification. A resilience review in five questions
Resilience review checklist
Check these five controls before production release.
-
Timeout and retry budget
Define timeout boundaries and cap retries per request path.
-
Idempotency guarantee
Confirm retries cannot create duplicate side effects or data corruption.
-
Safe fallback behaviour
Specify how the user journey degrades when dependencies fail.
-
Saturation detection
Alert on queue growth, timeout ratio, and error budget burn rate.
-
Rollback and recovery
Document rollback triggers and the exact rollback command path.
Reflection prompt
Where do timeouts or retries make things worse in your current system.
Mental model
Resilience and performance
Resilience is how you behave on bad days. Performance is how you behave on normal days.
-
1
Load
-
2
System
-
3
Latency
-
4
Errors
-
5
Actions
Assumptions to keep in mind
- Budgets exist. Budgets make trade-offs explicit and keep systems stable.
- Degradation is designed. Degrade gracefully instead of failing catastrophically.
Failure modes to notice
- Cascading failure. One failure can spread quickly without isolation and backpressure.
- Optimising the mean. Optimising the average can hide tail latency that users feel.
Check yourself
Quick check. Resilience and scale
0 of 8 opened
Why use circuit breakers
To stop failure cascades when dependencies are down.
What is backpressure
Slowing or shedding load to protect the system.
Why can retries be dangerous
They can multiply load during outages.
Where should caches sit
Close to reads that need speed but can accept staleness.
What is graceful degradation
Continuing with reduced features instead of total failure.
Why plan for scale early
Because traffic patterns rarely stay small.
What is a simple scaling model
Capacity equals throughput per node times number of nodes.
What should you monitor in scale tests
Latency, error rate, and saturation.
Artefact and reflection
Artefact
A resilience checklist for one critical user journey
Reflection
Where in your work would explain why timeouts and retries can increase harm change a decision, and what evidence would make you trust that change?
Optional practice
Adjust timeouts and retries to see the risk balance.