Applied Data · Module 1
Data architectures and pipelines
Data architecture is how data is organised, moved, and protected across systems.
Previously
Start with Data Intermediate
Move into models, pipelines, and applied analytics while keeping reliability in view.
This module
Data architectures and pipelines
Data architecture is how data is organised, moved, and protected across systems.
Next
Data governance and stewardship
Governance is agreeing how data is handled so people can work quickly without being reckless.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
Imagine a daily batch pipeline that loads meter readings.
What you will be able to do
- 1 Explain data architectures and pipelines in your own words and apply it to a realistic scenario.
- 2 A pipeline is safer when interfaces are explicit and tested.
- 3 Check the assumption "Contracts are versioned" and explain what changes if it is false.
- 4 Check the assumption "Monitoring exists" and explain what changes if it is false.
Before you begin
- Foundations-level vocabulary and concepts
- Confidence with basic diagrams and section terminology
Common ways people get this wrong
- Breaking downstream. A field rename can break many consumers. Contracts and tests prevent this.
- Data drift. The same pipeline can produce different meaning over time. Track distribution changes.
Main idea at a glance
Diagram
Stage 1
Sources
Data originates from multiple sources across systems.
I think distinguishing between real-time and batch sources early saves redesign pain later.
Data architecture is how data is organised, moved, and protected across systems. It sets the lanes so teams can build without tripping over each other. Pipelines exist because raw data is messy and scattered. They pull from sources, clean and combine, and land it where people and products can use it.
There are two broad ways data moves. Batch means scheduled chunks. Streaming means small events flowing continuously. Both need clear boundaries so one team's changes do not break another's work. If a pipeline fails, dashboards go blank, models drift, and trust drops.
When you design a pipeline, think about ownership at each hop, the contracts between steps, and how to recover when something breaks. A simple diagram often exposes gaps before a single line of code is written.
Worked example. A pipeline that fails silently is worse than one that fails loudly
Worked example. A pipeline that fails silently is worse than one that fails loudly
Imagine a daily batch pipeline that loads meter readings. One day, a source system changes a column name from meter_id to meterId.
The ingestion step still runs. The storage step still runs. Your dashboard still loads. It just starts showing zeros because the join keys no longer match.
My opinion is that silent failure is the main enemy of data work. It looks like success and it teaches people to distrust the whole system. If you build only one thing into a pipeline, build a check that screams when the shape or meaning changes.
Common mistakes (pipeline edition)
Pipeline failure patterns
These are the highest-frequency causes of trust loss.
-
No ingestion contract
Skipping types, required fields, allowed ranges, and units creates silent breakage.
-
Batch versus streaming by fashion
Choose based on latency and reliability needs, not tooling trends.
-
No recovery design
If the 02:00 run fails, define what users see at 09:00 and how recovery happens.
-
No owner per hop
Without ownership, failures persist because nobody has a clear duty to fix.
Verification. Prove you can reason about it
Pipeline verification drill
Use one real dataset and prove operability.
-
Sketch the pipeline
Draw source-to-consumer flow and list one failure mode per hop.
-
Write one ingestion contract sentence
Example: every record requires a UTC timestamp and a meter identifier as string.
-
Select movement mode with justification
Choose batch or streaming using a specific latency requirement.
Diagram
Stage 1
Input quality
Validate structure and constraints as data enters.
I think input validation is non-negotiable because downstream fixes cost exponentially more.
How to use Data Intermediate
This is where you stop being impressed by dashboards and start asking whether the data deserves trust.
- Good practice
- Treat every dataset like a service. It has an owner, a contract, and quality guarantees. If those do not exist, you are relying on luck.
- Bad practice
- Assuming it is in the warehouse means it is correct. Warehouses can store lies very efficiently.
- Best practice
- Write down the failure modes per pipeline hop and the detection signal for each. That turns data work into an operable system, not a fragile project.
Mental model
Pipeline with contracts
A pipeline is safer when interfaces are explicit and tested.
-
1
Source
-
2
Ingest
-
3
Contract
-
4
Transform
-
5
Serve
Assumptions to keep in mind
- Contracts are versioned. Breaking changes should be deliberate and communicated. Versioning makes change safe.
- Monitoring exists. If a job fails silently, the first alarm is a business incident. Monitor early.
Failure modes to notice
- Breaking downstream. A field rename can break many consumers. Contracts and tests prevent this.
- Data drift. The same pipeline can produce different meaning over time. Track distribution changes.
Check yourself
Quick check. Architectures and pipelines
0 of 6 opened
Why do pipelines exist
To move, clean, and combine data so it can be used reliably.
Scenario. A dashboard is correct at 09:00 and wrong at 10:00. Name one pipeline failure mode that fits
A late arriving feed, a join key change, a schema change, a backfill re-running with different logic, or a duplicate event stream inflating counts.
What is batch movement
Moving data in scheduled chunks.
What is streaming movement
Moving small events continuously as they happen.
Scenario. A producer changes a field name and consumers silently break. What boundary was missing
A data contract boundary. You needed schema compatibility rules, versioning, and an alert on breaking changes.
Why start with a diagram
It reveals missing steps and ownership before building, and it makes failure modes and responsibilities visible.
Artefact and reflection
Artefact
A one-page decision note with assumption, evidence, and chosen action
Reflection
Where in your work would explain data architectures and pipelines in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?
Optional practice
Drag and connect sources, processors, and consumers to see how data flows and where ownership sits.