Module 24 of 26 · Practice & Strategy

Platforms and distributed systems

30 min read 3 outcomes Interactive + terminal 5 references

By the end of this module you will be able to:

Explain MapReduce and the shared-nothing architecture that enables petabyte-scale processing
Compare Snowflake, BigQuery, and Databricks as cloud data platform choices
Explain the open table formats Delta Lake and Apache Iceberg and their role in the data lakehouse pattern

Server racks in a large data centre representing distributed computing infrastructure at hyperscale

Real-world scale · 2004

Google MapReduce: processing the entire web with commodity hardware

In 2004, Google engineers Jeff Dean and Sanjay Ghemawat published "MapReduce: Simplified Data Processing on Large Clusters." The paper described a programming model that decomposed large computation tasks into two phases: a Map phase that processes each input record independently and emits key-value pairs, and a Reduce phase that aggregates all values for each unique key.

The insight was that most large-scale data transformations could be expressed in this pattern, and that pattern was trivially parallelisable. Counting word frequencies across a billion documents: each Map function processes one document and emits (word, 1) pairs; each Reduce function sums the counts for one word. With 10,000 machines, the job runs in minutes instead of months.

MapReduce became the foundation of Hadoop, which became the foundation of the entire big data ecosystem. Although the original MapReduce framework has largely been replaced by Apache Spark and cloud-native query engines, the shared-nothing, partition-parallel architecture it introduced is still the underlying model of every major data platform today.

Google needed to index billions of web pages. A single powerful machine could not hold the web. MapReduce split the problem into independent units that could run in parallel on thousands of commodity machines. What fundamental insight does this encode about large-scale data processing?

With the learning outcomes established, this module begins by examining shared-nothing architecture and apache spark in depth.

24.1 Shared-nothing architecture and Apache Spark

Shared-nothing architecture distributes data and computation across nodes that do not share memory or storage. Each node processes its local partition independently, communicating only to shuffle data between partitions (for joins and aggregations). This architecture scales linearly: doubling the number of nodes approximately doubles throughput, because each additional node handles an additional partition.

Apache Spark replaced Hadoop MapReduce by keeping intermediate results in memory (RAM) rather than writing them to disk between each computation step. For iterative algorithms like machine learning training, where the same data is processed many times, this in-memory execution reduces job time by 10-100x compared to disk-bound MapReduce. Spark introduced resilient distributed datasets (RDDs) and later DataFrames and Datasets, which provide a columnar, SQL-friendly API while retaining the underlying parallel execution model.

Spark's lazy evaluation model builds a directed acyclic graph (DAG) of transformations before executing any computation. When an action (such as count or write) is triggered, the Catalyst optimiser rewrites the DAG to eliminate redundant steps, push filters as early as possible, and select efficient physical execution plans. This query optimisation is why writing idiomatic Spark code (using the DataFrame API, not RDDs) consistently outperforms manually optimised lower-level code.

With an understanding of shared-nothing architecture and apache spark in place, the discussion can now turn to cloud data platforms: snowflake, bigquery, and databricks, which builds directly on these foundations.

“Spark can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. When running on large datasets, the in-memory advantage is decisive for iterative algorithms.”
Matei Zaharia et al., 'Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing', NSDI (2012)
The performance advantage of in-memory execution is not just about speed: it makes entire categories of algorithm (iterative ML training, graph algorithms, interactive analytics) practical at scale. Hadoop MapReduce made batch analytics possible at petabyte scale; Spark made interactive and iterative analytics practical at the same scale.

Loading interactive component...

24.2 Cloud data platforms: Snowflake, BigQuery, and Databricks

Snowflake (launched 2014) introduced the multi-cluster, shared data architecture. Storage and compute are separated: data is stored in cloud object storage (S3, Azure Blob, GCS); compute clusters (virtual warehouses) are spun up on demand, independently sized, and can be paused when not in use. Multiple warehouses can query the same data simultaneously without contention. This separation means storage cost is always on (data must persist) but compute cost is only incurred when queries run.

Google BigQuery (launched 2010, publicly available 2012) is a serverless query engine: there are no clusters to manage. Users submit SQL queries and are billed per terabyte of data scanned. BigQuery's Dremel execution engine distributes queries across thousands of servers automatically. The trade-off is that query cost is proportional to data scanned, which incentivises partitioning and clustering tables to reduce scan scope.

Databricks (founded 2013, based on Apache Spark) unifies batch processing, streaming, and machine learning in a single platform. Databricks Runtime optimises Spark performance with proprietary improvements to the Catalyst optimiser and Delta Engine. It is the platform of choice for organisations whose workloads blend data engineering (ETL), data science (ML training), and SQL analytics in a single environment.

With an understanding of cloud data platforms: snowflake, bigquery, and databricks in place, the discussion can now turn to the data lakehouse: delta lake and apache iceberg, which builds directly on these foundations.

Common misconception

“Cloud data warehouses like Snowflake and BigQuery are just hosted databases and offer no architectural advantages over on-premise SQL servers.”

Cloud warehouses differ architecturally from traditional RDBMS in several critical ways: (1) compute/storage separation allows independent scaling, so you never over-provision; (2) columnar storage with vectorised execution processes analytical queries 10-100x faster than row-store engines; (3) massively parallel query execution across hundreds of nodes for large scans; (4) serverless or auto-scaling compute eliminates idle capacity cost. A query that takes 45 minutes on an on-premise SQL Server can run in 8 seconds on BigQuery at a fraction of the total cost.

24.3 The data lakehouse: Delta Lake and Apache Iceberg

Data lakes (raw files in object storage, schema-on-read) offered cheap scalable storage but no ACID guarantees, no schema enforcement, and poor query performance on small files. Data warehouses offered ACID and performance but were expensive and required schema-on-write. The lakehouse pattern combines cheap open-format storage with warehouse-quality reliability by adding a transactional metadata layer over the data lake.

Delta Lake (Databricks, open-sourced 2019) adds a transaction log to Parquet files stored in object storage. Every write creates a new log entry describing the files added or removed. This provides ACID transactions (readers never see partial writes), time travel (querying the state of the table at any past timestamp), schema evolution (adding columns without rewriting history), and efficient upserts and deletes. Delta Lake is the foundation of the Databricks Lakehouse Platform.

Apache Iceberg (Netflix, Apache project since 2020) solves the same problems with a different architecture: a tree of metadata files (manifest lists, manifests, and data files) rather than a sequential log. Iceberg's hidden partitioning allows queries to scan only relevant partitions without requiring the user to specify the partition scheme in the query. Iceberg is engine-agnostic: it works with Spark, Flink, Trino, and BigQuery. It has become the open standard alternative to Delta Lake in vendor-neutral architectures.

Abstract data flow visualisation representing distributed data pipelines moving through a lakehouse architecture — The data lakehouse combines the cost efficiency of cloud object storage with the reliability guarantees of a data warehouse. Delta Lake and Iceberg provide ACID transactions, time travel, and schema evolution on top of Parquet files in S3 or GCS.

24.4 Check your understanding

A data engineering team runs a nightly Spark job that reads 2TB of raw events, applies 12 sequential transformations, and writes aggregated results to a data warehouse. The job takes 4 hours and the team needs to reduce it to under 30 minutes. Which architectural change offers the most use?

A team using a data lake (CSV files in S3) discovers that a pipeline bug wrote incorrect customer records last Tuesday. They need to: (1) view the table as it was before the bad write, and (2) restore the correct state. Which table format supports this without manual backup restoration?

A BigQuery table containing 10 years of web clickstream data (5TB total) is queried for events from the last 7 days. A query engineer notices that every query scans the full 5TB and costs $25 per run, even though only 50GB of recent data is relevant. What is the most effective fix?

Loading interactive component...

Key takeaways

Shared-nothing architecture distributes data and computation across independent nodes, each processing its own partition. This scales linearly and is the underlying model of every major data platform (Spark, Snowflake, BigQuery, Databricks).
Apache Spark replaced Hadoop MapReduce by executing in memory rather than writing to disk between steps, providing 10-100x speedup for iterative algorithms. The DataFrame API enables Catalyst query optimisation automatically.
Cloud warehouses (Snowflake, BigQuery, Databricks) separate compute from storage, enabling independent scaling, serverless operation, and columnar execution across hundreds of nodes for analytical queries.
Data lakehouse formats (Delta Lake, Apache Iceberg) add ACID transactions, time travel, and schema evolution to raw Parquet files in cloud object storage, combining the cost efficiency of data lakes with the reliability of warehouses.

Standards and sources cited in this module

Jeff Dean and Sanjay Ghemawat, 'MapReduce: Simplified Data Processing on Large Clusters', OSDI (2004)
The founding paper for distributed data processing. Introduced the Map/Reduce programming model that underpins all subsequent big data frameworks.
Matei Zaharia et al., 'Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing', NSDI (2012)
The Spark paper introducing RDDs and in-memory distributed computation. Source for the 100x performance improvement claim over Hadoop MapReduce.
Delta Lake documentation: ACID transactions and time travel
Reference for Delta Lake's transaction log, time travel queries, schema evolution, and RESTORE TABLE commands.
Apache Iceberg documentation
Reference for Iceberg's snapshot model, hidden partitioning, and engine-agnostic open table format specification.
Snowflake: Multi-cluster Shared Data Architecture (technical whitepaper)
Explains the compute/storage separation, virtual warehouse scaling, and multi-tenant architecture that distinguishes Snowflake from traditional MPP warehouses.

Back: Advanced analytics and inference Next: Governance, regulation, and compliance

Module 24 of 26 · Practice & Strategy