MODULE 1 OF 5 · APPLIED

Data Pipelines and the Medallion Pattern

45 min read 3 outcomes Interactive quiz

By the end of this module you will be able to:

Describe the bronze, silver, and gold layers of the medallion architecture and explain what transformations occur at each tier
Explain change data capture as a mechanism for near-real-time ingestion without full table scans, and name the tools that implement it
Select appropriate orchestration and data quality tooling for a given pipeline SLA requirement

Server infrastructure representing a production data pipeline processing tax transactions (photo on Unsplash)

Real-world programme · HMRC Making Tax Digital · 2019 to present

10 billion transactions. One pipeline failure. 12 million businesses unable to file.

HMRC's Making Tax Digital programme processes over 10 billion VAT transactions per year. Before MTD, VAT reconciliation was done quarterly with paper returns filed by post. MTD required building a near-real-time pipeline: source accounting software submitting via API, validation against HMRC's business rules, ingestion into the HMRC data lake, compliance checks, and publication to the digital tax account.

Any failure at any stage meant businesses could not file their VAT return. A data quality issue in the validation layer could silently reject valid submissions. A pipeline that ran successfully but produced incorrect figures could trigger compliance investigations against businesses that had done nothing wrong. At 10 billion transactions a year, the error rate that is acceptable in a prototype becomes a six-figure support burden in production.

The previous module covered the governance and risk frameworks that sit around digital systems. This module goes inside one of the most critical infrastructure components of any digital programme: the data pipeline. The medallion architecture is the pattern that makes pipelines of this scale maintainable, auditable, and recoverable.

When 10 billion transactions depend on your pipeline, what does a data quality failure actually cost?

With the learning outcomes established, this module begins by examining what a data pipeline is in depth.

7.1 What a data pipeline is

A data pipeline is a sequence of automated steps that moves data from one or more source systems to a destination where it can be used. Every pipeline has three fundamental concerns: ingestion (getting data from the source), transformation (changing it into the shape needed), and loading (writing it to the destination). This sequence is called ETL (Extract, Transform, Load) in its traditional form, or ELT (Extract, Load, Transform) in modern cloud architectures where transformation happens inside the data platform after loading.

Two delivery modes exist: batch and streaming. A batch pipeline collects data over a period (hourly, daily, nightly) and processes it in bulk. Batch is simpler to build and reason about but introduces latency proportional to the collection interval. A streaming pipeline processes each record as it arrives, reducing latency to seconds but requiring more complex infrastructure to handle ordering, backpressure, and exactly-once semantics.

The trade-off is latency versus throughput. Streaming systems optimise for low latency per record; batch systems optimise for high throughput across many records. Most enterprise programmes use both: streaming for operational events that need immediate action (fraud alerts, system anomalies) and batch for analytical workloads where slightly older data is acceptable (daily sales reports, monthly regulatory submissions).

Batch and streaming pipelines both need a consistent way to organise the data they move. Section 7.2 introduces the medallion architecture - the pattern that separates raw ingestion, cleaning, and business aggregation into distinct, independently recoverable layers.

7.2 The Medallion Architecture

The medallion architecture organises a data platform into three layers of progressively increasing data quality. Databricks developed and popularised the pattern; it has since been adopted across UK financial services, the NHS, and central government departments.

Bronze: raw data exactly as ingested from source systems, never modified or deleted
Silver: cleansed and conformed data, with validated types and resolved duplicates
Gold: aggregated and business-ready data serving analytical queries and reports

The core design principle is separation of concerns. A pipeline that mixes raw ingestion, cleaning logic, and business aggregation in a single step is fragile: when a source system changes its format, or a cleaning rule needs updating, the whole thing must be rebuilt or carefully untangled. Separating these into three distinct layers means failures are contained, transformations are independently testable, and reprocessing is safe.

If a cleaning rule in the silver layer is found to be wrong three months after deployment, the fix is to correct the silver logic and rerun it against the unchanged bronze data. The bronze layer is the recovery source. Without it, that recovery path does not exist.

“The goal of the Bronze layer is to preserve the raw data exactly as it arrived from the source. Data in the Bronze layer is not modified or deleted.”
Databricks Medallion Architecture Documentation - 2022, databricks.com
This constraint is what makes error recovery possible. When a downstream transformation error is discovered weeks later, engineers return to bronze and reprocess with corrected logic. If the bronze layer had modified or discarded records during ingestion, that recovery path would not exist. The cost of storing raw data is trivial compared to the cost of not having it.

Common misconception

“ETL is the same thing as a data pipeline.”

ETL (Extract, Transform, Load) is one pattern within the broader concept of a data pipeline. Modern cloud data architectures predominantly use ELT (Extract, Load, Transform) where raw data is loaded into the platform first and transformation happens in-place using SQL or DataFrame operations. ETL implies transformation happening before the data reaches its destination, which is the older on-premises pattern. The medallion architecture is an ELT approach: bronze holds the raw extracted data, and silver and gold apply transformations after loading.

Loading interactive component...

Gold and bronze medallions arranged in order, a visual metaphor for the medallion data architecture pattern — The bronze-to-silver-to-gold progression represents increasing data quality at each tier. Raw ingestion at bronze preserves everything; silver cleaning removes duplicates and enforces types; gold aggregation shapes data for specific business consumers.

The medallion pattern defines where data should live at each stage of quality. But how does data get into the bronze layer efficiently? Section 7.3 covers change data capture - the ingestion technique that avoids rescanning entire tables on every pipeline run.

7.3 Change Data Capture

Change Data Capture (CDC) is a technique for detecting and propagating row-level changes in a source database without scanning the entire table. Rather than extracting all rows at each pipeline run, CDC reads the database transaction log and extracts only the rows that were inserted, updated, or deleted since the last extraction.

In PostgreSQL, the transaction log is called the Write-Ahead Log (WAL). In Oracle it is the Redo Log. In SQL Server it is the transaction log exposed through SQL Server CDC tables. These logs record every data modification in sequence. A CDC tool reads the log and converts each entry into a structured change event.

Debezium, an open-source CDC platform maintained as part of the Red Hat portfolio, implements CDC connectors for PostgreSQL, MySQL, Oracle, SQL Server, and MongoDB. Debezium publishes change events to Apache Kafka topics. A bronze layer pipeline consuming from those topics receives row-level changes in near real time, with latency typically under 5 seconds, compared to the multi-hour latency of nightly batch extraction.

CDC reduces source database load by orders of magnitude for large, slowly-changing tables. A product catalogue of 2 million rows where 500 records change daily extracts 500 records per day with CDC versus 2 million records per day with full table extraction. This difference matters when the source database is a production OLTP (Online Transaction Processing) system that cannot sustain heavy analytical read load.

Common misconception

“Streaming architecture means low-latency delivery.”

Streaming architecture is not the same as low latency. Apache Kafka can achieve end-to-end latencies of tens of milliseconds under normal conditions, but under backpressure (when consumers are slower than producers), events accumulate in topic partitions and latency increases. Kafka consumer groups lag behind the head of the log. A streaming pipeline with consumer lag measured in millions of events can have latencies of minutes or hours. Low latency requires both a streaming architecture and consumer capacity sufficient to keep up with production rate.

Loading interactive component...

CDC handles efficient ingestion. Once data is flowing into the medallion layers, something must schedule, sequence, and monitor all the steps. Section 7.4 covers Apache Airflow and dbt - the orchestration tools that keep pipelines running reliably at scale.

7.4 Pipeline Orchestration

Orchestration is the scheduling, sequencing, monitoring, and error handling of pipeline tasks. An orchestrator ensures tasks run in dependency order, retries on transient failure, alerts when SLAs are at risk, and provides a centralised view of pipeline state across an entire data platform.

Apache Airflow, originally developed at Airbnb in 2014 and open-sourced in 2015, represents pipelines as DAGs (Directed Acyclic Graphs). A DAG is a set of tasks with explicit dependency relationships: task B does not run until task A has completed successfully. Airflow's SLA callbacks fire when a task exceeds its configured deadline, enabling proactive alerting before the end-to-end SLA is breached.

dbt (data build tool) handles the transformation layer of the pipeline. dbt expresses transformations as SQL SELECT statements compiled into views or materialised tables. Each dbt model is version-controlled, testable (dbt tests assert row counts, null rates, uniqueness), and produces lineage documentation showing which source tables feed which downstream models. A dbt model is an engineering artefact that can be reviewed, tested, and deployed with the same rigour as application code.

Airflow and dbt are complementary: Airflow orchestrates the execution sequence and monitors SLA compliance across the whole pipeline; dbt handles the SQL transformation logic within the silver and gold layers.

Loading interactive component...

Airflow and dbt ensure pipelines run in the right order and on schedule. But a pipeline that completes on time can still produce wrong results. Section 7.5 covers data quality frameworks and the independent SLA dimensions that govern what "done" actually means.

7.5 Data Quality and Pipeline SLAs

Pipeline monitoring must cover two independent dimensions: execution quality (did the job run?) and data quality (did the job produce correct results?). A pipeline that completes successfully but produces a gold table where 15% of records are duplicated has met its execution SLA but failed its data quality SLA. Both require independent alerting.

Great Expectations is an open-source Python framework for defining, documenting, and asserting data quality rules. A Great Expectations suite defines expectations against a dataset: expect the row count to be between 4.7 million and 5 million; expect null rate in the submission_id column to be zero; expect the period_end field to contain valid ISO 8601 dates. When a suite runs against a new dataset, any violated expectation produces an alert and an HTML report documenting the failure.

The NHS England data quality framework defines four dimensions for health data, applicable to any domain:

Completeness: no required fields are null
Validity: values conform to defined code sets
Consistency: the same fact reported by different systems agrees
Timeliness: data arrives within the defined window

A pipeline SLA defines both a completion deadline (timeliness) and minimum quality thresholds across the other three.

A pipeline that runs on time but silently degrades data quality is worse than a pipeline that fails loudly. Failure alerts the team. Silent degradation reaches downstream consumers undetected.

“Pipeline reliability is a product of two independent components: execution reliability and data quality reliability. Either one failing constitutes a pipeline SLA breach.”
Apache Airflow Documentation - Stable, airflow.apache.org
This framing is important for governance conversations. Business stakeholders often focus on uptime (did it run?) and neglect data quality (was the output correct?). An SLA should define both: the pipeline must run by 6am and the gold table must pass quality checks with zero critical violations. Both conditions must be met for the SLA to be considered satisfied.

Data flow diagram showing bronze, silver, and gold processing layers — Gold-tier data powers business intelligence dashboards such as Power BI, Tableau, and Looker. The quality of what analysts see is entirely determined by the correctness of the bronze-to-silver-to-gold transformations upstream.

Loading interactive component...

7.6 Check your understanding

A data engineering team at a UK bank discovers that 8% of mortgage application records were silently dropped during silver layer processing six weeks ago, due to a null handling rule that rejected applications with a missing secondary income field. What property of the medallion architecture makes full recovery possible?

HMRC receives VAT submission data from 12 different accounting software vendors, with a combined daily volume of 28 million records. The current pipeline runs a nightly full table extraction that takes 6.5 hours and exhausts the extraction window before all sources have been ingested. A data architect proposes Debezium CDC. Which outcome most accurately describes the result?

A nightly pipeline producing gold layer compliance reports has an SLA of 06:00. On 14 March, Airflow reports all tasks completed successfully by 05:47. The compliance team opens their Power BI dashboard and finds that 22% of reporting period totals are zero where they should not be. Which statement best describes the SLA status?

Check your understanding

In a Medallion Architecture, raw JSON events land in the Bronze layer. A data engineer notices that 15% of events have a null 'customer_id' field. At which layer should this data quality issue be addressed?

Key takeaways

The medallion architecture separates raw ingestion (bronze), cleaning and conforming (silver), and business-ready aggregation (gold) into distinct layers, isolating failures and enabling safe reprocessing from the bronze source.
Bronze data is never modified; its append-only, raw-preservation property is what makes recovery possible when downstream transformation errors are discovered weeks or months later.
Change data capture reads only changed rows from the source database transaction log, reducing extraction volume by orders of magnitude and enabling near-real-time bronze ingestion without heavy source database load.
Apache Airflow orchestrates pipeline execution and SLA monitoring via DAG dependencies; dbt handles SQL transformation logic within silver and gold layers as version-controlled, testable models.
Pipeline monitoring must cover execution (did the job run?) and data quality (did the job produce correct results?) independently; monitoring execution alone creates false confidence in pipeline reliability.
The NHS and HMRC data quality frameworks share the same quality dimensions applicable to any domain: completeness, validity, consistency, and timeliness. All must be defined in a pipeline SLA.

Standards and sources cited in this module

Databricks Medallion Architecture Documentation
2022, databricks.com/glossary/medallion-architecture
Canonical description of the bronze, silver, and gold layer pattern with ACID properties and Delta Lake implementation. Primary reference for Sections 7.2 through 7.4.
Apache Airflow Documentation, SLA and DAG Management
Stable, airflow.apache.org/docs
DAG-based pipeline orchestration, SLA callback configuration, and task dependency management. Referenced in Section 7.4 and quoted in Section 7.5.
dbt Analytics Engineering Best Practices
docs.getdbt.com/best-practices
Model structure, testing, and lineage tracking for silver layer transformations. Referenced throughout Section 7.4 for transformation engineering practice.
Debezium Documentation, Change Data Capture
debezium.io/documentation
CDC connector implementation for PostgreSQL, MySQL, Oracle, and SQL Server. Named in Section 7.3 for near-real-time bronze ingestion architecture.
NHS England Data Quality Framework
nhsengland.nhs.uk, 2022
Four-dimension data quality model (completeness, validity, consistency, timeliness) cited in Section 7.5 as a cross-domain quality framework applied to pipeline SLA design.
HMRC Making Tax Digital: VAT Technical Specification
developer.service.hmrc.gov.uk
Primary case study source for the opening story. Documents the API submission, validation, and compliance checking pipeline that processes over 10 billion transactions annually.

The pipeline infrastructure is in place. The next module covers what you do with the data once it arrives in the gold layer: measurement frameworks, OKRs, KPIs, and experimentation that turn data into decisions.

Previous: Risks and governance Next: Analytics, measurement and control

Module 7 of 15 in Applied