MODULE 15 OF 6 · APPLIED

Scalability Patterns

30 min read 4 outcomes Interactive quiz

By the end of this module you will be able to:

Distinguish horizontal scaling from vertical scaling and identify when each applies
Apply caching at the correct layer for a given performance problem, and design an appropriate invalidation strategy
Explain read replicas and database sharding as database scaling strategies and identify the trade-offs of each
Design a queue-based architecture to decouple producers from consumers and absorb traffic spikes

Data centre infrastructure (photo from Unsplash)

Real-world incident · July 2012

Instagram scales to 30 million users on 3 engineers. The architecture that made it possible.

When Facebook acquired Instagram in April 2012 for one billion dollars, Instagram had 30 million registered users and 13 employees, only 3 of whom were engineers maintaining the backend infrastructure. The system was handling 1,000 photo uploads per second at peak, serving tens of millions of users daily.

The Instagram engineering team published a detailed breakdown of their architecture in May 2011, when they had 1 million users. The choices they made then scaled to 30 million. Their application servers were stateless Django instances behind a HAProxy load balancer, allowing horizontal scaling by simply adding more instances. PostgreSQL handled the data, with read replicas routing the majority of traffic away from the primary. Redis handled media feeds and session state. Photo storage went to Amazon S3, with Amazon CloudFront as a CDN for static asset delivery.

None of these components were invented by Instagram. They assembled proven scalability patterns: stateless application servers for horizontal scaling, read replicas to reduce database read pressure, Redis caching for high-frequency reads, and a CDN for static content. The architecture was not clever; it was composed of well-understood patterns applied to the right bottleneck at each stage of growth.

If you have 30 million users and 3 engineers, what architecture decisions let you scale without adding more engineers?

With the learning outcomes established, this module begins by examining vertical versus horizontal scaling in depth.

15.1 Vertical versus horizontal scaling

Vertical scaling (scaling up) means adding capacity to a single server: more CPU cores, more RAM, faster storage. It is operationally simple: the application requires no changes and deployment complexity does not increase. The limits are hardware maximums, which are finite and increasingly expensive at the high end, and a single point of failure: if the server fails, the service is unavailable.

Horizontal scaling (scaling out) means adding more instances of a service and distributing load across them via a load balancer. It has no theoretical upper limit: add more instances to add more capacity. Each instance can fail without taking the entire service down. The cost scales linearly with instance count rather than superlinearly as with high-spec servers.

Horizontal scaling requires that application servers be stateless. If a server stores session state locally (a user's login session in memory), a second server cannot serve that user's subsequent requests. All state must be externalised: session state in Redis, user data in the database, uploaded files in object storage. Any instance must be able to serve any request.

Common misconception

“You should always prefer horizontal scaling over vertical scaling.”

Vertical scaling is often the correct first step. It requires no application changes, is operationally simpler, and for many systems at early stages the hardware maximum is never approached. The right question is: what is the actual bottleneck? If a single database server is the constraint and the application cannot be made stateless easily, vertical scaling buys time with minimal risk. Horizontal scaling adds complexity (load balancers, external state stores, distributed caches) that is only justified when the benefits outweigh the costs.

Common misconception

“Horizontal scaling is always better than vertical scaling.”

Vertical scaling (bigger machines) is simpler, requires no code changes, and avoids distributed system complexity. For many workloads, a single powerful database server handles the load for years. Horizontal scaling introduces consensus problems, network partitions, and operational overhead. Scale vertically until you cannot, then scale horizontally only the components that need it.

With an understanding of vertical versus horizontal scaling in place, the discussion can now turn to caching, which builds directly on these foundations.

15.2 Caching

Caching stores the result of an expensive computation or database query so it can be returned quickly on subsequent requests without repeating the work. It is the most impactful single optimisation for read-heavy systems, and also one of the most common sources of subtle bugs.

The cache-aside pattern is the most common implementation: the application checks the cache first; on a cache miss, it fetches from the database, stores the result in the cache with a time-to-live (TTL), and returns it to the caller. On the next request for the same key, the cache returns the stored value directly.

Caching can be applied at multiple layers, each with different trade-offs. A CDN (Content Delivery Network) caches static assets and HTML pages at edge locations geographically close to users, reducing both latency and origin server load. An API gateway can cache GET responses for public, idempotent endpoints. An application cache (Redis is the standard) stores database query results and computed values with configurable TTLs.

The invalidation strategy must be designed before implementing the cache, not after discovering stale data in production. TTL expiry is the simplest approach: the cache entry expires after N seconds. Write-through invalidation updates the cache on every write, keeping it consistent at the cost of write overhead. Event-driven invalidation uses domain events to explicitly remove or update cache entries when the underlying data changes, providing stronger consistency at higher implementation complexity.

“There are only two hard things in computer science: cache invalidation and naming things.”
Phil Karlton - Widely attributed, circa 1996
The aphorism exists because stale cache data is subtle and difficult to diagnose. A cache returns incorrect data silently: no error, no exception, just wrong results that may persist for the duration of the TTL. The difficulty is not implementing a cache; it is designing the invalidation strategy so that the system's observable behaviour remains correct when data changes.

“Premature optimisation is the root of all evil.”
Donald Knuth, The Art of Computer Programming (1974) - Structured Programming with go to Statements
Knuth's warning applies directly to scalability. Teams that design for millions of users before they have a hundred often build complexity they never need. The skill is knowing which critical three percent matters.

With an understanding of caching in place, the discussion can now turn to database scaling: read replicas and sharding, which builds directly on these foundations.

15.3 Database scaling: read replicas and sharding

Application servers scale horizontally with relative ease: they are stateless, and adding a new instance requires only provisioning and registration with the load balancer. Databases are harder because they hold shared, mutable state. Two strategies address the most common database bottlenecks.

Read replicas address read throughput. The primary database handles all writes. One or more read-only replicas receive a copy of all writes via replication and serve read queries. The majority of database operations in most applications are reads; routing them to replicas reduces load on the primary dramatically. The cost is replication lag: replicas are slightly behind the primary. For most reads (a product catalogue, a news feed), this is acceptable. For reads that must reflect a just-completed write (a user viewing their own updated profile), the application must read from the primary.

Sharding addresses write throughput and data volume. The dataset is partitioned horizontally across multiple database instances. Each shard holds a subset of the data. Queries for a given record go to the shard that holds it. Sharding allows the write load to be distributed across multiple primaries, breaking through the vertical scaling limit of a single database.

The cost of sharding is significant: cross-shard queries are expensive and sometimes impossible without application-level joins; re-sharding when the partition key produces uneven distribution requires data migration; transactions across shards require distributed transaction protocols. Sharding should be adopted only when simpler strategies (vertical scaling, read replicas, caching) have been exhausted and a specific write throughput or data volume bottleneck has been identified.

With an understanding of database scaling: read replicas and sharding in place, the discussion can now turn to queue-based load levelling, which builds directly on these foundations.

15.4 Queue-based load levelling

Queues decouple the rate at which work is accepted from the rate at which it is processed. This is valuable for absorbing traffic spikes: during a peak, the API enqueues requests at the incoming rate (which can be very high) and workers process them at a sustainable rate (which is bounded by their capacity). The queue absorbs the difference.

Without a queue, a traffic spike that doubles the incoming request rate requires either doubling the processing capacity (which may not be possible instantly) or dropping requests. With a queue, the spike is absorbed and processed as capacity allows, at the cost of increased latency for requests submitted during the peak.

The queue depth is the key operational metric: a growing queue means workers are not keeping up. Auto-scaling workers based on queue depth is the standard cloud pattern. Amazon SQS with Lambda or ECS auto-scaling, and Google Cloud Pub/Sub with Cloud Run, both support this pattern natively. The queue depth threshold for scaling triggers should be set based on the acceptable latency SLO (Service Level Objective) for the queued work.

Queue-based load levelling is appropriate when the work is asynchronous and the caller does not need an immediate result: image processing, report generation, email delivery, and data exports are all good candidates. It is not appropriate when the user must wait for the result in their current session, which requires a synchronous response or a polling mechanism.

“The cloud-native approach to handling variable load is to decouple acceptance from processing using a queue. The API surface accepts requests at any rate; the workers process them at a rate they can sustain.”
AWS Well-Architected Framework - Performance Efficiency Pillar: Queue-Based Load Levelling pattern
The key insight is that acceptance rate and processing rate are two separate design concerns. An HTTP endpoint can accept thousands of requests per second and return 202 Accepted immediately; the processing can happen minutes later. Decoupling these rates allows both to be optimised independently and prevents traffic spikes from causing failures.

15.5 Check your understanding

A news website sees a 50x traffic spike whenever a major story breaks. The current architecture is a single application server and a single PostgreSQL database. Which improvement provides the highest immediate impact with the least implementation risk?

The editorial team updates a breaking story, but readers see the old version for up to 60 seconds due to a 60-second cache TTL. Which approach reduces this staleness window without removing caching entirely?

Instagram served 30 million users with 3 backend engineers. Their application servers stored no local state. Why is statelessness a prerequisite for horizontal scaling?

What is the primary operational challenge introduced by database sharding that makes it a last-resort scaling strategy?

Having explored the core scalability patterns, this interactive diagram lets you visualise how vertical scaling, caching, database sharding, and queue-based levelling complement one another under different load conditions.

Explore the concepts interactively

Use this interactive diagram to explore the concepts discussed in this module. Click on elements to see how they relate to each other and to the patterns covered above.

Loading interactive component...

Key takeaways

Horizontal scaling requires stateless application servers with all state externalised (Redis for sessions, database for persistence). It provides resilience as well as capacity.
Caching is the highest-impact single optimisation for read-heavy systems. Design the invalidation strategy before implementing the cache; stale data bugs are silent and hard to diagnose.
Read replicas scale read throughput with low complexity. Sharding scales write throughput and data volume at high operational cost. Exhaust simpler strategies before adopting sharding.
Queue-based load levelling decouples the acceptance rate from the processing rate, absorbing spikes without dropping requests. Use queue depth as the auto-scaling trigger.
Identify the actual bottleneck before choosing a scaling pattern. Applying the wrong pattern adds complexity without solving the problem.

Standards and sources cited in this module

AWS Well-Architected Framework: Performance Efficiency Pillar. docs.aws.amazon.com.
Scaling compute resources, data management, and queue-based load levelling
The most thorough practical guide to scalability decisions in cloud-native systems. The queue-based load levelling quote in Section 15.4 is drawn from this framework. Used throughout Section 15.3 for the read replica and sharding trade-offs.
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media.
Chapter 5: Replication, Chapter 6: Partitioning
The authoritative reference for database replication, sharding, and distributed data systems. Used in Section 15.3 for the replication lag explanation and the sharding key trade-offs.
Instagram Engineering Blog. What Powers Instagram: Hundreds of Instances, Dozens of Technologies. May 2011.
Full post
The primary source for the Instagram opening story. The architecture details (HAProxy, Django, PostgreSQL read replicas, Redis, S3, CloudFront) are drawn from this post and used to illustrate the practical application of the patterns in this module.

What comes next: You have completed the Applied stage. Stage 3 takes the patterns from Stages 1 and 2 and applies them to real-world digital and cloud-scale systems. Module 16 starts with hexagonal and clean architecture: the patterns that make infrastructure replaceable by placing the domain at the centre.

Previous: Resilience and Fault Tolerance Next: Hexagonal and Clean Architecture

Module 15 of 22 in Applied