Module 12 of 26 · Applied

Architectures and pipelines

30 min read 4 outcomes Interactive ETL/ELT animator + drag challenge 5 standards cited

By the end of this module you will be able to:

  • Distinguish ETL from ELT and explain when each is appropriate
  • Compare batch and streaming processing for a given latency requirement
  • Describe the key architectural difference between a data warehouse and a data lake
  • Identify the role of each component in a modern data stack

Pipeline as a chain of contracts from source to serve

A pipeline is a chain of explicit contracts from source through validate to serve, with quarantine and replay as a closed loop.

Pipeline as a chain of explicit contracts from source to serve Five cards left to right: Source, Ingest, Validate (emphasised), Transform, Serve. Verb arrows extract, land raw, approved rows, published contract. A red-accent callout names quarantine + replay as the closed-loop pattern that distinguishes a pipeline from a script. PIPELINE · FIVE EXPLICIT CONTRACTS · DATABRICKS MEDALLION + ODCS 1ODCS v3.1SourceSystem, file, event2Databricks BronzeIngestCapture raw payload3UK GDQFValidateSchema, quality, freshness4Databricks SilverTransformClean, join, model5Databricks GoldServeAPI, dataset, report extractland rawapproved rowspublished contract Quarantine and replay close the loop Rejected rows route to quarantine with reason and re-enter validate after repair. Pipelines without this loop silently lose data. ransfordsnotes.com

A pipeline is a chain of explicit contracts, not a flat sequence of scripts. Source, ingest, validate, transform, serve each carry their own contract; failures route to quarantine and replay back into validate. Databricks medallion architecture names the same five-step shape.

Medallion architecture promotion gates between layers

Medallion architecture moves data Bronze to Silver to Gold with explicit promotion gates between layers.

Medallion architecture moves data Bronze to Silver to Gold Three cards left to right: Bronze (raw, immutable), Silver (cleaned, conformed, emphasised), Gold (business-ready). Brand-red arrows with verbs cleansed by, aggregated by. A red-accent callout names the promotion gate between layers as the quality control. MEDALLION ARCHITECTURE · BRONZE -> SILVER -> GOLD · DATABRICKS 1DatabricksBronze layerRaw + immutable, no validation2DatabricksSilver layerCleaned, deduplicated, schema-enforced3DatabricksGold layerBusiness-ready aggregates cleansed byaggregated by Promotion gates are the quality control Bronze -> Silver runs cleansing and schema checks; Silver -> Gold runs aggregation logic against the business contract. Skip a gate and the layer below leaks defects up. ransfordsnotes.com

Medallion architecture moves data through Bronze (raw, immutable), Silver (cleaned, deduplicated, schema-enforced), and Gold (business-ready, aggregated). The promotion gate between each layer is a named quality check; without it, the layers collapse into one bag of files. Databricks documents the same shape.

Four named pipeline failure modes with four actions

Pipeline failures have four named modes with four different actions; naming the mode turns alerts into actions.

Pipeline failure modes classified into four named actions Four cards left to right: Schema drift (reject + alert), Quality miss (hold + repair, emphasised), Freshness miss (refresh), Contract break (notify consumer). Verb arrows then in sequence. A red-accent callout names the misclassification cost: a quality miss treated as schema drift discards repairable rows. PIPELINE FAILURE MODES · FOUR NAMED ACTIONS 1ODCS v3.1Schema driftAction: reject + alert producer2UK GDQF Pr.5Quality missAction: hold + repair3UK GDQF Pr.3Freshness missAction: refresh + reschedule4ODCS v3.1Contract breakAction: notify consumer vsvsvs Misclassification has a direct cost A quality miss treated as schema drift discards repairable rows. A freshness miss treated as a contract break burns consumer trust. Name the mode first. ransfordsnotes.com

Pipeline failures have four named modes: schema drift, quality miss, freshness miss, and downstream contract break. Each has a different action: reject + alert, hold + repair, refresh, notify consumer. Naming the mode turns vague pipeline-broken alerts into specific actions.

Deterministic Data course visual for Architectures and pipelines

Real-world scale · ongoing

Netflix processes 500 billion events per day. Not all of them need the same speed.

Netflix processes over 500 billion events per day from user interactions: plays, pauses, searches, ratings, and scroll behaviour. Billing aggregations run as nightly batch jobs. Recommendation updates run in near-real-time through Apache Kafka streams.

The architecture choice determines what is possible. The Foundations stage covered what data is and how it is governed. This module opens the Applied stage by examining how data moves through systems at scale.

Billing aggregations can wait hours. But when a user finishes an episode, the next recommendation must appear in seconds. How does architecture make both possible?

A data pipeline is a sequence of automated processes that moves data from source systems to a destination, applying transformations along the way. Every pipeline must answer three questions: where does the data come from, what processing is required, and where does it need to go?

With the learning outcomes established, this module begins by examining etl versus elt in depth.

12.1 ETL versus ELT

ETL (Extract, Transform, Load) transforms data outside the destination in a separate processing layer, then loads clean data. ELT (Extract, Load, Transform) loads raw data first, then transforms inside the destination using its compute power.

ETL dominated when warehouse compute was expensive. ELT emerged because cloud warehouses like BigQuery, Snowflake, and Redshift offer elastic compute that makes in-warehouse transformation faster and cheaper than external processing. dbt (data build tool) is the dominant ELT transformation tool, using SQL models versioned in Git.

Data architecture defines the blueprint for managing data assets by aligning with organisational strategy to establish strategic data requirements and designs to meet those requirements.

DAMA-DMBOK2 (2017) - Chapter 4, Data Architecture

DAMA frames architecture as strategy-driven, not technology-driven. The choice between ETL and ELT, batch and streaming, warehouse and lakehouse all depend on what the organisation needs from its data, not on which technology is newest.

Common misconception

ELT is always better than ETL because it uses the warehouse's compute.

ELT is more efficient when the destination has elastic compute (cloud warehouses). But ETL remains appropriate when transformation requires external libraries (Python ML models, geospatial processing), when data must be cleansed before entering a regulated destination, or when the destination has limited compute capacity. The choice is architectural, not dogmatic.

With an understanding of etl versus elt in place, the discussion can now turn to batch versus streaming, which builds directly on these foundations.

12.2 Batch versus streaming

Batch processing collects data over a period (hourly, daily, weekly), then processes it all at once. Streaming processing handles data continuously as it arrives, event by event. The trade-off is latency versus complexity.

Batch is simpler, cheaper, and sufficient for most analytical workloads. Streaming is necessary when decisions depend on fresh data: fraud detection (seconds matter), recommendation engines (the next suggestion must appear before the user leaves), and real-time dashboards (operations centres monitoring live systems).

Apache Kafka is the dominant streaming platform. It handles event ingestion at millions of events per second. Apache Flink and Spark Structured Streaming process those events with transformations and aggregations.

Use streaming when the business value of low-latency data justifies the operational complexity. Use batch for everything else.

AWS Well-Architected Framework, Data Analytics Lens (2023) - Design Principle: Right-size data processing

AWS's guidance reflects industry consensus: streaming adds operational complexity (exactly-once semantics, out-of-order events, backpressure handling). Unless the use case genuinely requires sub-minute latency, batch processing is simpler and more cost-effective.

With an understanding of batch versus streaming in place, the discussion can now turn to warehouse, lake, and lakehouse, which builds directly on these foundations.

12.3 Warehouse, lake, and lakehouse

A data warehouse (Snowflake, BigQuery, Redshift) stores structured, schema-enforced data optimised for SQL analytics. It provides fast query performance but requires data to be modelled before loading.

A data lake(S3, Azure Data Lake, GCS) stores raw data in any format: structured, semi-structured, and unstructured. It provides flexibility but risks becoming a "data swamp" without governance, cataloguing, and quality controls.

A data lakehouse (Databricks Lakehouse, Apache Iceberg, Delta Lake) combines both: open file formats stored in a lake with warehouse-like features (ACID transactions, schema enforcement, time-travel queries). The lakehouse pattern emerged around 2020 as the dominant modern architecture for organisations that need both analytical and machine learning workloads.

Common misconception

A data lake is just a cheap data warehouse.

A data lake stores raw, unstructured, and semi-structured data that a warehouse cannot handle (images, log files, sensor data, JSON documents). The key difference is schema enforcement: warehouses require schema-on-write (define the structure before loading), while lakes allow schema-on-read (interpret the structure when querying). A lakehouse adds warehouse-like governance to lake storage.

Loading interactive component...
Loading interactive component...
12.4 Check your understanding

A fintech startup uses BigQuery as its data warehouse. Raw transaction data lands in BigQuery via Fivetran connectors. A data engineer writes SQL transformations in dbt to create analytics-ready tables. Which pipeline pattern is this?

A logistics company needs to calculate optimal delivery routes. Route calculations require Python geospatial libraries (not SQL). The results feed into an operational database. Should this use ETL or ELT?

A bank runs nightly batch pipelines for regulatory reporting. The CFO asks whether switching to streaming would improve report accuracy. What is the best response?

Loading interactive component...
Check your understanding

A retail company loads daily sales CSVs into a data warehouse using an ETL process. They want to add real-time inventory updates. Which architecture change is most appropriate?

Key takeaways

  • ETL transforms data outside the destination; ELT transforms inside. ELT dominates in cloud warehouses (BigQuery, Snowflake, Redshift) because elastic compute makes in-warehouse transformation faster and cheaper.
  • Batch processing handles data in scheduled chunks; streaming handles it continuously. Streaming adds complexity (exactly-once semantics, backpressure) and should only be used when sub-minute latency provides genuine business value.
  • Data warehouses enforce schema-on-write for structured analytics. Data lakes store any format with schema-on-read. Lakehouses (Delta Lake, Iceberg) combine both with ACID transactions and time-travel queries.
  • The modern data stack follows the pattern: source, ingest (Fivetran/Airbyte), warehouse (cloud), transform (dbt), serve (BI/ML), observe (quality monitoring).
  • Architecture choices should be driven by business requirements (latency, scale, cost), not by technology trends. Netflix uses both batch and streaming because different use cases demand different latencies.

Standards and sources cited in this module

  1. DAMA-DMBOK2 (2017)

    Chapter 4, Data Architecture

    Defines data architecture as strategy-driven, not technology-driven. Provides the conceptual framework for pipeline and storage architecture decisions.

  2. AWS Well-Architected Framework, Data Analytics Lens (2023)

    Design Principles

    Industry guidance on right-sizing data processing: use streaming only when latency justifies complexity.

  3. Databricks, 'Lakehouse: A New Generation of Open Platforms' (2021)

    Full paper

    Academic paper introducing the lakehouse architecture combining lake storage flexibility with warehouse-like governance.

  4. dbt Labs, 'What is dbt?' (2024)

    Documentation

    dbt is the dominant ELT transformation tool. Used in the terminal simulation and referenced throughout the ELT discussion.

  5. Netflix Technology Blog, 'Evolution of the Netflix Data Pipeline' (2016)

    Full post

    Source for the 500 billion events/day figure and the batch/streaming architecture discussion in the opening case study.

Module 12 of 26 · Applied Data