Applied Data · Module 1

Data architectures and pipelines

Data architecture is how data is organised, moved, and protected across systems.

20 min 4 outcomes Data Intermediate

Previously

Start with Data Intermediate

Move into models, pipelines, and applied analytics while keeping reliability in view.

This module

Data architectures and pipelines

Data architecture is how data is organised, moved, and protected across systems.

Next

Data governance and stewardship

Governance is agreeing how data is handled so people can work quickly without being reckless.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

Imagine a daily batch pipeline that loads meter readings.

What you will be able to do

  • 1 Explain data architectures and pipelines in your own words and apply it to a realistic scenario.
  • 2 A pipeline is safer when interfaces are explicit and tested.
  • 3 Check the assumption "Contracts are versioned" and explain what changes if it is false.
  • 4 Check the assumption "Monitoring exists" and explain what changes if it is false.

Before you begin

  • Foundations-level vocabulary and concepts
  • Confidence with basic diagrams and section terminology

Common ways people get this wrong

  • Breaking downstream. A field rename can break many consumers. Contracts and tests prevent this.
  • Data drift. The same pipeline can produce different meaning over time. Track distribution changes.

Main idea at a glance

Diagram

Stage 1

Sources

Data originates from multiple sources across systems.

I think distinguishing between real-time and batch sources early saves redesign pain later.

Data architecture is how data is organised, moved, and protected across systems. It sets the lanes so teams can build without tripping over each other. Pipelines exist because raw data is messy and scattered. They pull from sources, clean and combine, and land it where people and products can use it.

There are two broad ways data moves. Batch means scheduled chunks. Streaming means small events flowing continuously. Both need clear boundaries so one team's changes do not break another's work. If a pipeline fails, dashboards go blank, models drift, and trust drops.

When you design a pipeline, think about ownership at each hop, the contracts between steps, and how to recover when something breaks. A simple diagram often exposes gaps before a single line of code is written.

Worked example. A pipeline that fails silently is worse than one that fails loudly

Worked example. A pipeline that fails silently is worse than one that fails loudly

Imagine a daily batch pipeline that loads meter readings. One day, a source system changes a column name from meter_id to meterId. The ingestion step still runs. The storage step still runs. Your dashboard still loads. It just starts showing zeros because the join keys no longer match.

My opinion is that silent failure is the main enemy of data work. It looks like success and it teaches people to distrust the whole system. If you build only one thing into a pipeline, build a check that screams when the shape or meaning changes.

Common mistakes (pipeline edition)

Pipeline failure patterns

These are the highest-frequency causes of trust loss.

  1. No ingestion contract

    Skipping types, required fields, allowed ranges, and units creates silent breakage.

  2. Batch versus streaming by fashion

    Choose based on latency and reliability needs, not tooling trends.

  3. No recovery design

    If the 02:00 run fails, define what users see at 09:00 and how recovery happens.

  4. No owner per hop

    Without ownership, failures persist because nobody has a clear duty to fix.

Verification. Prove you can reason about it

Pipeline verification drill

Use one real dataset and prove operability.

  1. Sketch the pipeline

    Draw source-to-consumer flow and list one failure mode per hop.

  2. Write one ingestion contract sentence

    Example: every record requires a UTC timestamp and a meter identifier as string.

  3. Select movement mode with justification

    Choose batch or streaming using a specific latency requirement.

Diagram

Stage 1

Input quality

Validate structure and constraints as data enters.

I think input validation is non-negotiable because downstream fixes cost exponentially more.

How to use Data Intermediate

This is where you stop being impressed by dashboards and start asking whether the data deserves trust.

Good practice
Treat every dataset like a service. It has an owner, a contract, and quality guarantees. If those do not exist, you are relying on luck.
Bad practice
Assuming it is in the warehouse means it is correct. Warehouses can store lies very efficiently.
Best practice
Write down the failure modes per pipeline hop and the detection signal for each. That turns data work into an operable system, not a fragile project.

Mental model

Pipeline with contracts

A pipeline is safer when interfaces are explicit and tested.

  1. 1

    Source

  2. 2

    Ingest

  3. 3

    Contract

  4. 4

    Transform

  5. 5

    Serve

Assumptions to keep in mind

  • Contracts are versioned. Breaking changes should be deliberate and communicated. Versioning makes change safe.
  • Monitoring exists. If a job fails silently, the first alarm is a business incident. Monitor early.

Failure modes to notice

  • Breaking downstream. A field rename can break many consumers. Contracts and tests prevent this.
  • Data drift. The same pipeline can produce different meaning over time. Track distribution changes.

Check yourself

Quick check. Architectures and pipelines

0 of 6 opened

Why do pipelines exist

To move, clean, and combine data so it can be used reliably.

Scenario. A dashboard is correct at 09:00 and wrong at 10:00. Name one pipeline failure mode that fits

A late arriving feed, a join key change, a schema change, a backfill re-running with different logic, or a duplicate event stream inflating counts.

What is batch movement

Moving data in scheduled chunks.

What is streaming movement

Moving small events continuously as they happen.

Scenario. A producer changes a field name and consumers silently break. What boundary was missing

A data contract boundary. You needed schema compatibility rules, versioning, and an alert on breaking changes.

Why start with a diagram

It reveals missing steps and ownership before building, and it makes failure modes and responsibilities visible.

Artefact and reflection

Artefact

A one-page decision note with assumption, evidence, and chosen action

Reflection

Where in your work would explain data architectures and pipelines in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?

Optional practice

Drag and connect sources, processors, and consumers to see how data flows and where ownership sits.

Source DAMA DMBOK 2 (Data Management Body of Knowledge, 2nd Edition)
Source ISO/IEC 11179 metadata registries
Source ISO/IEC 27701:2025 privacy information management
Source ICO data protection principles and UK GDPR guidance