Module 24 of 26 · Practice & Strategy

Platforms and distributed systems

30 min read 3 outcomes Interactive + terminal 5 references

By the end of this module you will be able to:

  • Explain MapReduce and the shared-nothing architecture that enables petabyte-scale processing
  • Compare Snowflake, BigQuery, and Databricks as cloud data platform choices
  • Explain the open table formats Delta Lake and Apache Iceberg and their role in the data lakehouse pattern

Batch and stream as choices on a latency-correctness boundary

Batch and stream are choices on a latency-correctness boundary, not philosophical opposites.

Batch and stream as choices on a latency-correctness boundary Four cards left to right: Batch (reproducible, large joins), Micro-batch (5 min latency, emphasised), Stream (sub-second), Lambda/Kappa (combined patterns). Verb arrows vs. A red-accent callout names the choice as latency-vs-reproducibility, not philosophical. BATCH vs STREAM · LATENCY-CORRECTNESS BOUNDARY 1 BATCHHadoop / SparkBatchReproducible, large joins2 MICRO-BATCHSpark StructuredMicro-batch5-min latency, near-real-time3 STREAMApache FlinkStreamSub-second decisions, stateful4 COMBINEDKreps + MarzLambda / KappaBoth, or stream as truth vsvsor both The choice is latency-vs-reproducibility, not philosophical Streaming a daily reconciliation report is overhead. Batch processing a fraud alert is too slow. Match the architecture to the decision window. ransfordsnotes.com

Batch and stream are not opposites; they are choices on a latency-and-correctness boundary. Batch wins for reproducibility, large joins, complex SQL. Stream wins for sub-second decisions and continuous state. Kappa architecture (Kreps) collapses the two into one stream; Lambda (Marz) runs both side by side.

CAP theorem: pick two of consistency, availability, partition

CAP theorem: pick two of consistency, availability, and partition tolerance; the practical choice is CP vs AP.

CAP theorem: pick two of consistency, availability, partition Three cards left to right: Consistency (every read sees the latest write), Availability (every request gets a response), Partition tolerance (system works under network split, emphasised). Verb arrows pick two. A red-accent callout names CP vs AP as the practical choice. CAP THEOREM · PICK TWO · GILBERT + LYNCH 2002 CCAPConsistencyEvery read sees the latest writeACAPAvailabilityEvery request gets a responsePGilbert+LynchPartition toleranceWorks under network split ++ Network partition is unavoidable; the real choice is CP vs AP Banks pick CP: refuse the transaction rather than show stale balance. Social feeds pick AP: show possibly stale content rather than fail. ransfordsnotes.com

CAP theorem: a distributed system can guarantee any two of consistency, availability, and partition tolerance, but not all three. Network partition is the unavoidable real-world condition, so the choice is really between CP (consistency under partition) and AP (availability under partition). Gilbert and Lynch 2002 proved this formally.

Stream recovery uses source offset, checkpoint, and exactly-once sink

Stream recovery uses three mechanisms: source offset, checkpointed state, and exactly-once sink.

Stream recovery uses source offset, checkpoint, and exactly-once sink Three cards left to right: Source offset (where to restart), Checkpointed state (what was computed, emphasised), Exactly-once sink (no duplicates). Verb arrows plus. A red-accent callout names what is lost without each. STREAM REPLAY · SAVEPOINT TRIO · APACHE FLINK 1Apache KafkaSource offsetRestart position in log2Apache FlinkCheckpointed stateAggregations + windows3Apache KafkaExactly-once sinkIdempotent or transactional write ++ Without all three, recovery silently corrupts No offset: re-process from now and lose history. No checkpoint state: aggregations reset. Non-idempotent sink: duplicates on retry. ransfordsnotes.com

Stream recovery uses three mechanisms: source offset (where to restart), checkpointed state (what was computed), exactly-once sink (no duplicate writes). Apache Flink names this the savepoint trio; without all three, recovery silently double-counts or loses events.

Deterministic Data course visual for Platforms and distributed systems

Real-world scale · 2004

Google MapReduce: processing the entire web with commodity hardware

In 2004, Google engineers Jeff Dean and Sanjay Ghemawat published "MapReduce: Simplified Data Processing on Large Clusters." The paper described a programming model that decomposed large computation tasks into two phases: a Map phase that processes each input record independently and emits key-value pairs, and a Reduce phase that aggregates all values for each unique key.

The insight was that most large-scale data transformations could be expressed in this pattern, and that pattern was trivially parallelisable. Counting word frequencies across a billion documents: each Map function processes one document and emits (word, 1) pairs; each Reduce function sums the counts for one word. With 10,000 machines, the job runs in minutes instead of months.

MapReduce became the foundation of Hadoop, which became the foundation of the entire big data ecosystem. Although the original MapReduce framework has largely been replaced by Apache Spark and cloud-native query engines, the shared-nothing, partition-parallel architecture it introduced is still the underlying model of every major data platform today.

Google needed to index billions of web pages. A single powerful machine could not hold the web. MapReduce split the problem into independent units that could run in parallel on thousands of commodity machines. What fundamental insight does this encode about large-scale data processing?

With the learning outcomes established, this module begins by examining shared-nothing architecture and apache spark in depth.

24.1 Shared-nothing architecture and Apache Spark

Shared-nothing architecture distributes data and computation across nodes that do not share memory or storage. Each node processes its local partition independently, communicating only to shuffle data between partitions (for joins and aggregations). This architecture scales linearly: doubling the number of nodes approximately doubles throughput, because each additional node handles an additional partition.

Apache Spark replaced Hadoop MapReduce by keeping intermediate results in memory (RAM) rather than writing them to disk between each computation step. For iterative algorithms like machine learning training, where the same data is processed many times, this in-memory execution reduces job time by 10-100x compared to disk-bound MapReduce. Spark introduced resilient distributed datasets (RDDs) and later DataFrames and Datasets, which provide a columnar, SQL-friendly API while retaining the underlying parallel execution model.

Spark's lazy evaluation model builds a directed acyclic graph (DAG) of transformations before executing any computation. When an action (such as count or write) is triggered, the Catalyst optimiser rewrites the DAG to eliminate redundant steps, push filters as early as possible, and select efficient physical execution plans. This query optimisation is why writing idiomatic Spark code (using the DataFrame API, not RDDs) consistently outperforms manually optimised lower-level code.

With an understanding of shared-nothing architecture and apache spark in place, the discussion can now turn to cloud data platforms: snowflake, bigquery, and databricks, which builds directly on these foundations.

Spark can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. When running on large datasets, the in-memory advantage is decisive for iterative algorithms.

Matei Zaharia et al., 'Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing', NSDI (2012)

The performance advantage of in-memory execution is not just about speed: it makes entire categories of algorithm (iterative ML training, graph algorithms, interactive analytics) practical at scale. Hadoop MapReduce made batch analytics possible at petabyte scale; Spark made interactive and iterative analytics practical at the same scale.

Loading interactive component...

24.2 Cloud data platforms: Snowflake, BigQuery, and Databricks

Snowflake (launched 2014) introduced the multi-cluster, shared data architecture. Storage and compute are separated: data is stored in cloud object storage (S3, Azure Blob, GCS); compute clusters (virtual warehouses) are spun up on demand, independently sized, and can be paused when not in use. Multiple warehouses can query the same data simultaneously without contention. This separation means storage cost is always on (data must persist) but compute cost is only incurred when queries run.

Google BigQuery (launched 2010, publicly available 2012) is a serverless query engine: there are no clusters to manage. Users submit SQL queries and are billed per terabyte of data scanned. BigQuery's Dremel execution engine distributes queries across thousands of servers automatically. The trade-off is that query cost is proportional to data scanned, which incentivises partitioning and clustering tables to reduce scan scope.

Databricks (founded 2013, based on Apache Spark) unifies batch processing, streaming, and machine learning in a single platform. Databricks Runtime optimises Spark performance with proprietary improvements to the Catalyst optimiser and Delta Engine. It is the platform of choice for organisations whose workloads blend data engineering (ETL), data science (ML training), and SQL analytics in a single environment.

With an understanding of cloud data platforms: snowflake, bigquery, and databricks in place, the discussion can now turn to the data lakehouse: delta lake and apache iceberg, which builds directly on these foundations.

Common misconception

Cloud data warehouses like Snowflake and BigQuery are just hosted databases and offer no architectural advantages over on-premise SQL servers.

Cloud warehouses differ architecturally from traditional RDBMS in several critical ways: (1) compute/storage separation allows independent scaling, so you never over-provision; (2) columnar storage with vectorised execution processes analytical queries 10-100x faster than row-store engines; (3) massively parallel query execution across hundreds of nodes for large scans; (4) serverless or auto-scaling compute eliminates idle capacity cost. A query that takes 45 minutes on an on-premise SQL Server can run in 8 seconds on BigQuery at a fraction of the total cost.

24.3 The data lakehouse: Delta Lake and Apache Iceberg

Data lakes (raw files in object storage, schema-on-read) offered cheap scalable storage but no ACID guarantees, no schema enforcement, and poor query performance on small files. Data warehouses offered ACID and performance but were expensive and required schema-on-write. The lakehouse pattern combines cheap open-format storage with warehouse-quality reliability by adding a transactional metadata layer over the data lake.

Delta Lake (Databricks, open-sourced 2019) adds a transaction log to Parquet files stored in object storage. Every write creates a new log entry describing the files added or removed. This provides ACID transactions (readers never see partial writes), time travel (querying the state of the table at any past timestamp), schema evolution (adding columns without rewriting history), and efficient upserts and deletes. Delta Lake is the foundation of the Databricks Lakehouse Platform.

Apache Iceberg (Netflix, Apache project since 2020) solves the same problems with a different architecture: a tree of metadata files (manifest lists, manifests, and data files) rather than a sequential log. Iceberg's hidden partitioning allows queries to scan only relevant partitions without requiring the user to specify the partition scheme in the query. Iceberg is engine-agnostic: it works with Spark, Flink, Trino, and BigQuery. It has become the open standard alternative to Delta Lake in vendor-neutral architectures.

24.4 Check your understanding

A data engineering team runs a nightly Spark job that reads 2TB of raw events, applies 12 sequential transformations, and writes aggregated results to a data warehouse. The job takes 4 hours and the team needs to reduce it to under 30 minutes. Which architectural change offers the most use?

A team using a data lake (CSV files in S3) discovers that a pipeline bug wrote incorrect customer records last Tuesday. They need to: (1) view the table as it was before the bad write, and (2) restore the correct state. Which table format supports this without manual backup restoration?

A BigQuery table containing 10 years of web clickstream data (5TB total) is queried for events from the last 7 days. A query engineer notices that every query scans the full 5TB and costs $25 per run, even though only 50GB of recent data is relevant. What is the most effective fix?

Loading interactive component...

Key takeaways

  • Shared-nothing architecture distributes data and computation across independent nodes, each processing its own partition. This scales linearly and is the underlying model of every major data platform (Spark, Snowflake, BigQuery, Databricks).
  • Apache Spark replaced Hadoop MapReduce by executing in memory rather than writing to disk between steps, providing 10-100x speedup for iterative algorithms. The DataFrame API enables Catalyst query optimisation automatically.
  • Cloud warehouses (Snowflake, BigQuery, Databricks) separate compute from storage, enabling independent scaling, serverless operation, and columnar execution across hundreds of nodes for analytical queries.
  • Data lakehouse formats (Delta Lake, Apache Iceberg) add ACID transactions, time travel, and schema evolution to raw Parquet files in cloud object storage, combining the cost efficiency of data lakes with the reliability of warehouses.

Standards and sources cited in this module

  1. Jeff Dean and Sanjay Ghemawat, 'MapReduce: Simplified Data Processing on Large Clusters', OSDI (2004)

    The founding paper for distributed data processing. Introduced the Map/Reduce programming model that underpins all subsequent big data frameworks.

  2. Matei Zaharia et al., 'Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing', NSDI (2012)

    The Spark paper introducing RDDs and in-memory distributed computation. Source for the 100x performance improvement claim over Hadoop MapReduce.

  3. Delta Lake documentation: ACID transactions and time travel

    Reference for Delta Lake's transaction log, time travel queries, schema evolution, and RESTORE TABLE commands.

  4. Apache Iceberg documentation

    Reference for Iceberg's snapshot model, hidden partitioning, and engine-agnostic open table format specification.

  5. Snowflake: Multi-cluster Shared Data Architecture (technical whitepaper)

    Explains the compute/storage separation, virtual warehouse scaling, and multi-tenant architecture that distinguishes Snowflake from traditional MPP warehouses.

Module 24 of 26 · Practice & Strategy