Module 12 of 26 · Applied

Architectures and pipelines

30 min read 4 outcomes Interactive ETL/ELT animator + drag challenge 5 standards cited

By the end of this module you will be able to:

Distinguish ETL from ELT and explain when each is appropriate
Compare batch and streaming processing for a given latency requirement
Describe the key architectural difference between a data warehouse and a data lake
Identify the role of each component in a modern data stack

Streaming service interface on a screen, representing the data architecture behind content delivery

Real-world scale · ongoing

Netflix processes 500 billion events per day. Not all of them need the same speed.

Netflix processes over 500 billion events per day from user interactions: plays, pauses, searches, ratings, and scroll behaviour. Billing aggregations run as nightly batch jobs. Recommendation updates run in near-real-time through Apache Kafka streams.

The architecture choice determines what is possible. The Foundations stage covered what data is and how it is governed. This module opens the Applied stage by examining how data moves through systems at scale.

Billing aggregations can wait hours. But when a user finishes an episode, the next recommendation must appear in seconds. How does architecture make both possible?

A data pipeline is a sequence of automated processes that moves data from source systems to a destination, applying transformations along the way. Every pipeline must answer three questions: where does the data come from, what processing is required, and where does it need to go?

With the learning outcomes established, this module begins by examining etl versus elt in depth.

12.1 ETL versus ELT

ETL (Extract, Transform, Load) transforms data outside the destination in a separate processing layer, then loads clean data. ELT (Extract, Load, Transform) loads raw data first, then transforms inside the destination using its compute power.

ETL dominated when warehouse compute was expensive. ELT emerged because cloud warehouses like BigQuery, Snowflake, and Redshift offer elastic compute that makes in-warehouse transformation faster and cheaper than external processing. dbt (data build tool) is the dominant ELT transformation tool, using SQL models versioned in Git.

“Data architecture defines the blueprint for managing data assets by aligning with organisational strategy to establish strategic data requirements and designs to meet those requirements.”
DAMA-DMBOK2 (2017) - Chapter 4, Data Architecture
DAMA frames architecture as strategy-driven, not technology-driven. The choice between ETL and ELT, batch and streaming, warehouse and lakehouse all depend on what the organisation needs from its data, not on which technology is newest.

Common misconception

“ELT is always better than ETL because it uses the warehouse's compute.”

ELT is more efficient when the destination has elastic compute (cloud warehouses). But ETL remains appropriate when transformation requires external libraries (Python ML models, geospatial processing), when data must be cleansed before entering a regulated destination, or when the destination has limited compute capacity. The choice is architectural, not dogmatic.

With an understanding of etl versus elt in place, the discussion can now turn to batch versus streaming, which builds directly on these foundations.

Data pipeline infrastructure where the choice between batch and streaming processing determines how quickly data reaches analysts and applications — Data pipelines run on infrastructure like this. The choice between batch and streaming processing determines how quickly data reaches analysts and applications.

12.2 Batch versus streaming

Batch processing collects data over a period (hourly, daily, weekly), then processes it all at once. Streaming processing handles data continuously as it arrives, event by event. The trade-off is latency versus complexity.

Batch is simpler, cheaper, and sufficient for most analytical workloads. Streaming is necessary when decisions depend on fresh data: fraud detection (seconds matter), recommendation engines (the next suggestion must appear before the user leaves), and real-time dashboards (operations centres monitoring live systems).

Apache Kafka is the dominant streaming platform. It handles event ingestion at millions of events per second. Apache Flink and Spark Structured Streaming process those events with transformations and aggregations.

“Use streaming when the business value of low-latency data justifies the operational complexity. Use batch for everything else.”
AWS Well-Architected Framework, Data Analytics Lens (2023) - Design Principle: Right-size data processing
AWS's guidance reflects industry consensus: streaming adds operational complexity (exactly-once semantics, out-of-order events, backpressure handling). Unless the use case genuinely requires sub-minute latency, batch processing is simpler and more cost-effective.

With an understanding of batch versus streaming in place, the discussion can now turn to warehouse, lake, and lakehouse, which builds directly on these foundations.

Real-time dashboards requiring streaming pipelines while most analytical reporting works with batch processing on hourly or daily schedules — Real-time dashboards require streaming pipelines. Most analytical reporting works perfectly well with batch processing on hourly or daily schedules.

12.3 Warehouse, lake, and lakehouse

A data warehouse (Snowflake, BigQuery, Redshift) stores structured, schema-enforced data optimised for SQL analytics. It provides fast query performance but requires data to be modelled before loading.

A data lake (S3, Azure Data Lake, GCS) stores raw data in any format: structured, semi-structured, and unstructured. It provides flexibility but risks becoming a "data swamp" without governance, cataloguing, and quality controls.

A data lakehouse (Databricks Lakehouse, Apache Iceberg, Delta Lake) combines both: open file formats stored in a lake with warehouse-like features (ACID transactions, schema enforcement, time-travel queries). The lakehouse pattern emerged around 2020 as the dominant modern architecture for organisations that need both analytical and machine learning workloads.

Common misconception

“A data lake is just a cheap data warehouse.”

A data lake stores raw, unstructured, and semi-structured data that a warehouse cannot handle (images, log files, sensor data, JSON documents). The key difference is schema enforcement: warehouses require schema-on-write (define the structure before loading), while lakes allow schema-on-read (interpret the structure when querying). A lakehouse adds warehouse-like governance to lake storage.

Loading interactive component...

12.4 Check your understanding

A fintech startup uses BigQuery as its data warehouse. Raw transaction data lands in BigQuery via Fivetran connectors. A data engineer writes SQL transformations in dbt to create analytics-ready tables. Which pipeline pattern is this?

A logistics company needs to calculate optimal delivery routes. Route calculations require Python geospatial libraries (not SQL). The results feed into an operational database. Should this use ETL or ELT?

A bank runs nightly batch pipelines for regulatory reporting. The CFO asks whether switching to streaming would improve report accuracy. What is the best response?

Loading interactive component...

Check your understanding

A retail company loads daily sales CSVs into a data warehouse using an ETL process. They want to add real-time inventory updates. Which architecture change is most appropriate?

Key takeaways

ETL transforms data outside the destination; ELT transforms inside. ELT dominates in cloud warehouses (BigQuery, Snowflake, Redshift) because elastic compute makes in-warehouse transformation faster and cheaper.
Batch processing handles data in scheduled chunks; streaming handles it continuously. Streaming adds complexity (exactly-once semantics, backpressure) and should only be used when sub-minute latency provides genuine business value.
Data warehouses enforce schema-on-write for structured analytics. Data lakes store any format with schema-on-read. Lakehouses (Delta Lake, Iceberg) combine both with ACID transactions and time-travel queries.
The modern data stack follows the pattern: source, ingest (Fivetran/Airbyte), warehouse (cloud), transform (dbt), serve (BI/ML), observe (quality monitoring).
Architecture choices should be driven by business requirements (latency, scale, cost), not by technology trends. Netflix uses both batch and streaming because different use cases demand different latencies.

Standards and sources cited in this module

DAMA-DMBOK2 (2017)
Chapter 4, Data Architecture
Defines data architecture as strategy-driven, not technology-driven. Provides the conceptual framework for pipeline and storage architecture decisions.
AWS Well-Architected Framework, Data Analytics Lens (2023)
Design Principles
Industry guidance on right-sizing data processing: use streaming only when latency justifies complexity.
Databricks, 'Lakehouse: A New Generation of Open Platforms' (2021)
Full paper
Academic paper introducing the lakehouse architecture combining lake storage flexibility with warehouse-like governance.
dbt Labs, 'What is dbt?' (2024)
Documentation
dbt is the dominant ELT transformation tool. Used in the terminal simulation and referenced throughout the ELT discussion.
Netflix Technology Blog, 'Evolution of the Netflix Data Pipeline' (2016)
Full post
Source for the 500 billion events/day figure and the batch/streaming architecture discussion in the opening case study.

Back: Ethics and trust Next: Governance and stewardship

Module 12 of 26 · Applied Data