This level moves from concepts to systems. The focus is on design, governance, interoperability, and analysis that fit real organisations. I write these notes as the questions I ask in real work, so you start thinking like a data professional rather than a tool user.
Data architecture is how data is organised, moved, and protected across systems. It sets the lanes so teams can build without tripping over each other. Pipelines exist because raw data is messy and scattered. They pull from sources, clean and combine, and land it where people and products can use it.
There are two broad ways data moves. Batch means scheduled chunks. Streaming means small events flowing continuously. Both need clear boundaries so one team's changes do not break another's work. If a pipeline fails, dashboards go blank, models drift, and trust drops.
When you design a pipeline, think about ownership at each hop, the contracts between steps, and how to recover when something breaks. A simple diagram often exposes gaps before a single line of code is written.
Imagine a daily batch pipeline that loads meter readings. One day, a source system changes a column name from meter_id to meterId.
The ingestion step still runs. The storage step still runs. Your dashboard still loads. It just starts showing zeros because the join keys no longer match.
My opinion is that silent failure is the main enemy of data work. It looks like success and it teaches people to distrust the whole system.
If you build only one thing into a pipeline, build a check that screams when the shape or meaning changes.
Checklist
Pipeline failure patterns
These are the highest-frequency causes of trust loss.
- No ingestion contract: Skipping types, required fields, allowed ranges, and units creates silent breakage.
- Batch versus streaming by fashion: Choose based on latency and reliability needs, not tooling trends.
- No recovery design: If the 02:00 run fails, define what users see at 09:00 and how recovery happens.
- No owner per hop: Without ownership, failures persist because nobody has a clear duty to fix.
Checklist
Pipeline verification drill
Use one real dataset and prove operability.
- Sketch the pipeline: Draw source-to-consumer flow and list one failure mode per hop.
- Write one ingestion contract sentence: Example: every record requires a UTC timestamp and a meter identifier as string.
- Select movement mode with justification: Choose batch or streaming using a specific latency requirement.
Diagram summary
- Sources: Data originates from multiple sources across systems.
- Ingestion: Pull data, buffer it, and apply initial quality checks.
- Storage: Organised landing zones for raw, processed, and historical data.
- Processing: Transform raw data into forms that answer real questions.
- Consumption: Data reaches people and products for decisions.
Flow: Sources -> Ingestion; Ingestion -> Storage; Storage -> Processing; Processing -> Consumption
Diagram summary
- Input quality: Validate structure and constraints as data enters.
- Processing quality: Check logical coherence as data is transformed.
- Output quality: Confirm delivered data meets expectations.
- Pass: Data meets all quality criteria and moves forward.
- Quarantine: Prevent use of data that failed checks.
- Alert: Inform owners that something needs attention.
Flow: Input quality -> Processing quality; Processing quality -> Output quality; Output quality -> Pass; Input quality -> Quarantine; Processing quality -> Alert
Interactive tool
Design a simple data pipeline
Drag and connect sources, processors, and consumers to see how data flows and where ownership sits.
Design a simple data pipeline is interactive. Enable JavaScript to use the tool.
Retrieval check
Quick check. Architectures and pipelines
Why do pipelines exist
To move, clean, and combine data so it can be used reliably.
Scenario. A dashboard is correct at 09:00 and wrong at 10:00. Name one pipeline failure mode that fits
A late arriving feed, a join key change, a schema change, a backfill re-running with different logic, or a duplicate event stream inflating counts.
What is batch movement
Moving data in scheduled chunks.
What is streaming movement
Moving small events continuously as they happen.
Scenario. A producer changes a field name and consumers silently break. What boundary was missing
A data contract boundary. You needed schema compatibility rules, versioning, and an alert on breaking changes.
Why start with a diagram
It reveals missing steps and ownership before building, and it makes failure modes and responsibilities visible.
Practice prompts
How to use Data Intermediate
This is where you stop being impressed by dashboards and start asking whether the data deserves trust.
- Good practice: Treat every dataset like a service. It has an owner, a contract, and quality guarantees. If those do not exist, you are relying on luck.
- Bad practice: Assuming it is in the warehouse means it is correct. Warehouses can store lies very efficiently.
- Best practice: Write down the failure modes per pipeline hop and the detection signal for each. That turns data work into an operable system, not a fragile project.
Governance is agreeing how data is handled so people can work quickly without being reckless. Ownership is the person or team that decides purpose and access. Stewardship is the day to day care of definitions, quality, and metadata. Accountability means someone can explain how a change was made and why.
Policies are not paperwork for their own sake. They are guardrails. Who can see a column, how long data is kept, what checks run before sharing. Even a shared spreadsheet is governance in miniature. Who edits, who reviews, what happens when something looks off.
Trust grows when policies are clear, controls are enforced, and feedback loops exist. If a report is wrong and nobody owns it, confidence collapses quickly.
If a team shares a spreadsheet called “final_final_v7”, that is governance, just done badly. There is still access control (who has the link), retention (how long it stays in inboxes), and change control (who overwrote which cell).
The only difference is that it is informal, invisible, and impossible to audit.
My opinion: governance should feel like good design. It should make the safe thing the easy thing.
When governance feels like punishment, teams route around it and create risk you cannot even see.
Checklist
Governance failure patterns
Governance fails when policy and operating reality diverge.
- Policy disconnected from reality: Exceptions become standard practice when policies ignore day-to-day constraints.
- Ownership without time allocation: A title alone does not deliver accountability unless operational time is assigned.
- Manual checks for repeatable risks: Automate schema and row-count drift detection to reduce silent failures.
- Temporary access with no expiry: Unbounded temporary access becomes a persistent security and compliance risk.
Checklist
Governance verification drill
Turn governance language into checks that can be executed.
- Define accountability envelope: Write purpose, owner, steward, retention period, and access scope for one dataset.
- Add one breaking-change guardrail: Define an automated check that catches schema breaks before release.
- Write an accountability sentence: State what investigation and response should look like if a report is wrong.
Diagram summary
- People: Define accountability and decide what data means.
- Policies: Codify how data is handled and protected.
- Data assets: Governed resources that teams rely on.
- Decisions: What people actually do with the data.
- Review: Learn from outcomes and adjust.
Flow: People -> Policies; Policies -> Data assets; Data assets -> Decisions; Decisions -> Review; Review -> Policies
Interactive tool
Try governing a dataset
Choose policies and see how risk, access, and usability shift for a sample dataset.
Try governing a dataset is interactive. Enable JavaScript to use the tool.
Retrieval check
Quick check. Governance and stewardship
Why does governance exist
To balance safe data use with speed and clarity.
What does ownership mean
Deciding purpose and access.
What does stewardship mean
Caring for definitions, quality, and metadata.
Why are policies useful
They are guardrails that prevent accidental misuse.
Scenario. A team asks for full access “temporarily” to ship a feature. What is a safer governance response
Grant least-privilege access for the minimum time, log it, and require justification and review. Temporary access without expiry is permanent risk.
What happens without governance
Confusion, rework, and falling trust in reports.
Interoperability means systems understand each other. It is shared meaning, not just shared pipes. Standards help through common formats, schemas, and naming. A schema is the agreed structure and data types. When systems misalign, data lands in the wrong fields, numbers become strings, or meaning gets lost.
Formats like JSON or CSV carry data, but standards and contracts explain what each field means. An API (application programming interface) without a contract is guesswork. A file without a schema requires detective work.
Small mismatches cause big effects. Dates in different orders, currencies missing, names split differently. Aligning schemas early saves hours of cleanup and prevents silent errors.
A join works only if the key represents the same thing on both sides. That sounds obvious until you meet real data.
One system uses customer_id as “account holder”. Another uses it as “billing contact”. Both are “customer”, until you try to reconcile charges.
My opinion: schema alignment is not a technical chore. It is a meaning negotiation. You need the business definition written down, not just the column name.
Cardinality describes how many records relate to how many records.
Checklist
Cardinality patterns
Know these patterns before writing joins on production data.
- One-to-one: Each record matches at most one record on the other side.
- One-to-many: One record can match multiple records on the other side.
- Many-to-many: Multiple records match multiple records and can multiply row counts quickly.
If a key value
k appears
ak times in table A and
bk times in table B, then an inner join produces
ak⋅bk rows for that key.
Total join rows across all keys is:
k∑akbk
This is why “duplicated keys” are not a small detail. They can change both correctness and performance.
Checklist
Join trust checklist
Do this before sharing any joined metric.
- Confirm key meaning: Validate business definition match, not just data type match.
- Check uniqueness and expected duplication: Where keys are non-unique, define whether multiplicative joins are acceptable.
- Compare row-count deltas: Explain row-count changes from pre-join to post-join before publishing.
Diagram summary
- Start join: Begin by asking if foundations are aligned.
- Keys aligned: Do both sides define the join key the same way?
- Reconcile: Define business meaning explicitly before proceeding.
- Check types: Do types and nullable constraints match?
- Transform: Insert transformation and validation at boundaries.
- Cardinality: Are keys unique or do they create row explosion?
- Profile: Measure actual key distributions and duplication.
- Run join: Execute and compare row counts before and after.
- Publish: Share results with assumptions and caveats documented.
Flow: Start join -> Keys aligned; Keys aligned -> Reconcile; Keys aligned -> Check types; Check types -> Transform; Transform -> Cardinality; Cardinality -> Profile; Profile -> Run join; Run join -> Publish
Interactive tool
Map data between systems
Align two simple schemas and see what breaks when fields do not match.
Map data between systems is interactive. Enable JavaScript to use the tool.
Retrieval check
Quick check. Interoperability and standards
What is interoperability
Systems sharing data with the same meaning.
What is a schema
An agreed structure and types for data.
Scenario. Two systems both have a field called `date` but one means local time and one means UTC. What should you do
Write the meaning explicitly in the contract, convert at the boundary, and add validation. Same name does not mean same meaning.
How do standards help
They reduce guesswork and make systems predictable.
What happens when schemas mismatch
Fields break, meaning is lost, and errors spread.
Why do contracts matter for APIs
They tell clients exactly what to send and expect.
Analysis is asking good questions of data and checking that the answers hold up. Descriptive thinking asks what happened. Diagnostic thinking asks why. Statistics exist to separate signal from noise. Averages summarise, distributions show spread, trends show direction.
Averages hide detail. A long tail or a split between groups can change the story. Trends can be seasonal or random. Always pair a number with context. When it was measured, who is included, what changed.
Insight is not a chart. It is a statement backed by data and understanding. Decisions follow insight, and they should note assumptions so they can be revisited.
If two things move together, it might be causation, or it might be a shared driver, or it might be coincidence.
In real organisations this becomes painful when a dashboard shows “A rose, then B rose”, and someone writes a strategy based on it.
My opinion: the best analysts are sceptical. They do not say “the chart says”. They say “the chart suggests, under these assumptions”.
The mean
xˉ is sensitive to outliers. The median is the middle value when sorted.
If the mean and median disagree strongly, that is a clue that the distribution is skewed or has outliers.
Pearson correlation between variables
X and
Y can be written as:
r=σXσYcov(X,Y)
- cov(X,Y): covariance (how the variables vary together)
- σX,σY: standard deviations of X and Y
- r: correlation coefficient in [−1,1]
Interpretation:
r measures linear association. It does not tell you direction of causality, and it can be distorted by outliers.
- H0: a null hypothesis (for example: no difference between groups)
- H1: an alternative hypothesis (there is a difference)
- Compute a test statistic from data and derive a p-value under H0
The p-value is not “the probability the null is true”. It is the probability of observing data as extreme as yours (or more) assuming
H0 is true.
This distinction matters because people misuse p-values constantly.
Checklist
Analysis failure patterns
These are the most common causes of confident but wrong conclusions.
- Correlation treated as causation: Association alone cannot justify causal claims without proper design.
- Single metric without uncertainty: A lone number without spread or caveats hides risk and variability.
- Baseline mismatch: Comparing groups with different baselines creates false performance claims.
- Metric definition drift ignored: Changes in logging, filtering, or exclusions can invalidate trend comparisons.
Checklist
Defensible insight checklist
Write this before presenting any high-impact claim.
- State the insight sentence: Write the conclusion in one sentence, then list explicit assumptions underneath.
- Check one counterexample: Test one alternative explanation that could also fit the data.
- Define mind-changing evidence: State what new data would make you revise the conclusion.
Diagram summary
- Raw data: Observations from the world.
- Aggregation: Calculate means, distributions, and correlations.
- Insight: Write one sentence summarising what you think the data shows.
- Challenge: Find one alternative explanation that fits the same data.
- Decision: Use the insight to make a choice and run an experiment.
- Feedback: Collect results and compare to prediction.
Flow: Raw data -> Aggregation; Aggregation -> Insight; Insight -> Challenge; Challenge -> Decision; Decision -> Feedback; Feedback -> Aggregation
Interactive tool
Explore patterns in data
Adjust simple filters and aggregations and watch insights change.
Explore patterns in data is interactive. Enable JavaScript to use the tool.
Retrieval check
Quick check. Analysis and insight
What is descriptive thinking
Explaining what happened.
What is diagnostic thinking
Explaining why something happened.
Scenario. The average looks fine but users complain. What should you check next
Percentiles and segments. Look at tails (p95, p99) and group splits. The mean often hides the pain.
Why does context matter
Time, group, and change affect interpretation.
What is an insight
A statement backed by data and understanding.
Data work is mostly uncertainty management. Probability is how we stay honest about that.
You do not need to love maths to use probability well. You need to be disciplined about what you are claiming.
If a pipeline succeeds 99% of the time, it still fails 1 day in 100. Over a year that is multiple failures.
The question is not “is 99 good”. The question is “what happens on the failure days, and what does it cost”.
Checklist
Probability failure patterns
These errors make reliable systems look safer than they are.
- Percentage and probability mixed: Comparing 12 and 0.12 as if they were different truths creates bad calculations.
- Rare treated as impossible: Low frequency events still dominate impact in many operational systems.
- Normality assumed by default: Heavy-tail behaviour is common in outages, latency, and fraud patterns.
Checklist
Probability sanity checks
Answer these before accepting reliability claims.
- Expected count check: If probability is 1%, estimate expected events over 10,000 runs.
- Sampling blind-spot check: If monitoring samples 1% of events, identify what failure types might be missed.
Diagram summary
- Define event: State clearly what event you are measuring and in what population.
- Estimate: Calculate the probability from historical data or reasoning.
- Convert: Multiply probability by scale to get expected number of failures.
- Assess impact: Evaluate damage, recovery effort, and user harm per failure.
- Adjust: Update alerts, redundancy, and recovery designs based on impact.
Flow: Define event -> Estimate; Estimate -> Convert; Convert -> Assess impact; Assess impact -> Adjust; Adjust -> Estimate
Retrieval check
Quick check. Probability and distributions
What does probability help you do in data work
Stay honest about uncertainty and avoid overconfident claims from limited observations.
Scenario. A pipeline succeeds 99% of the time. Over a year, why can that still be painful
Because 1% failure still means multiple failures across many runs, and those failures can hit on high impact days.
What is a distribution
A description of how values are spread, not just the average.
Why can the mean be misleading
Outliers and skew can make the mean hide the typical experience.
What is one reason heavy tails matter in services
Rare slow or failing events can dominate user experience and cost, even if the average looks fine.
Inference is the art of learning about a bigger reality from limited observations.
This matters because most datasets are not the full world. They are a sample, often a biased one.
You analyse only customers who completed a journey because that is what is easy to track.
Your dashboard shows high satisfaction.
The people who dropped off never appear, so the system looks healthier than it is.
Checklist
Inference failure patterns
Inference breaks when sample limitations are hidden.
- Observed treated as universal truth: Sample evidence does not automatically generalise to all users or contexts.
- Point estimate without uncertainty: No interval or sample size means no basis for confidence.
- A/B tests treated as truth machines: Instrumentation gaps and assignment bias can invalidate conclusions.
Checklist
Sampling integrity checklist
Use this before publishing any experiment result.
- Inclusion and exclusion map: State clearly who is represented and who is missing.
- Dropout mechanism check: Document what causes people or events to disappear from the dataset.
- Instrumentation drift detector: Define how you will detect measurement process changes over time.
Retrieval check
Quick check. Inference and experiments
What is inference
Learning about a bigger reality from limited observations.
Scenario. You analyse only customers who completed a journey. What is the risk
Survivorship bias. You miss the people who dropped out, so you overestimate success and satisfaction.
Why does sample size matter
Small samples produce noisy estimates and can make random variation look like a pattern.
What is one question you ask before trusting a result
Who is included, who is missing, and why.
What breaks an A and B test quietly
Bias in who gets assigned, changes in instrumentation, and differences in data capture between groups.
Modelling is not magic. It is choosing inputs, choosing an objective, and checking failure modes.
The purpose of modelling is not to impress people. It is to make a useful prediction with known limitations.
If only 1% of cases are fraud, a model that always predicts “not fraud” gets 99% accuracy.
That is why evaluation needs multiple metrics and a clear cost model for errors.
Checklist
Model risk patterns
These mistakes create high apparent performance and poor real outcomes.
- Leakage: The model sees information that proxies the answer and inflates performance.
- Single-metric optimisation: Optimising one metric can increase false positives, workload, or trust harm.
- Static thresholds: Thresholds must be reviewed as behaviour, costs, and base rates change.
Checklist
Minimal model review
Run this before sign-off and after major context changes.
- Label governance: Define label ownership and how label quality is checked.
- Feature proxy review: Inspect top features for hidden proxies and bias channels.
- Error-cost framing: Quantify false-positive and false-negative cost by stakeholder group.
- Human-in-loop design: Specify where human review occurs and what authority that reviewer has.
Retrieval check
Quick check. Modelling basics
Why can 99% accuracy be useless
If the positive cases are rare, a model can look accurate while missing every important case.
What is leakage
When the model learns from information that would not be available at prediction time, often a proxy for the answer.
Why do thresholds matter
They trade false positives against false negatives, which changes workload, cost, and harm.
What is one question you ask about labels
Who decides the label and whether it is reliable and consistent.
What does human in the loop mean in practice
A defined point where a person reviews or overrides the model, with clear criteria and feedback into improvement.
A mature organisation treats important datasets like products. They have owners, documentation, quality expectations, and support.
This is how you reduce “shadow spreadsheets” and make reuse normal.
If every request becomes a one-off extract, you are not serving data. You are doing bespoke reporting at scale.
A data product replaces that with a stable interface, clear meaning, and quality guarantees.
Checklist
Data product one-page template
This is the minimum viable contract for reusable datasets.
- Name: State what the product is and what it is not.
- Owner: Name accountable owner and support route.
- Refresh: State update frequency and freshness target.
- Quality: List mandatory checks and failure handling behaviour.
- Access: Define who can use it and under what conditions.
Retrieval check
Quick check. Data as a product
What does it mean to treat data as a product
Important datasets have owners, documentation, quality expectations, and support, like a service.
Why does 'send me the extract' culture hurt
It creates one off work, inconsistent definitions, and fragile decision making.
Name two things a data product page should include
Owner, refresh cadence, definition, quality checks, and access rules.
What is one benefit of stable interfaces for data
Teams can reuse data without repeated bespoke work and without silently changing meaning.
What is one risk if ownership is unclear
Problems do not get fixed, and trust in the dataset collapses.
Data risk is broader than security. Misuse, misinterpretation, and neglect can harm people and decisions. Ethics asks whether we should use data in a certain way, not just whether we can. Strategic value comes from using data to improve services, not just to collect more of it.
Risk points appear along the lifecycle. Collection without consent, processing without checks, sharing beyond purpose, keeping data forever. Controls and culture reduce these risks. A small habit, like logging changes or reviewing outliers, prevents large mistakes.
Treat data as a long term asset. Good stewardship, clear value cases, and honest communication build trust that lasts beyond a single project.
Checklist
Risk and ethics judgement drill
Use this to build practical judgement, not abstract compliance language.
- Choose a least-bad option: Pick one risky scenario and justify the decision explicitly.
- Name likely harm and stakeholder: State who is most exposed and what concrete harm looks like.
- Add one realistic control: Choose a safeguard a small team can actually run consistently.
Diagram summary
- Collect: Data originates from users, systems, or sensors.
- Consent: Verify that collection matches declared purpose and user understanding.
- Process: Clean, aggregate, and prepare data for use.
- Bias controls: Test whether processing creates unfair outcomes for groups.
- Share: Provide data to consumers inside and outside the organisation.
- Access control: Log who accessed what and enforce least-privilege access.
- Archive: Move old data to cold storage or delete it per retention policy.
- Retention: Regularly review whether data still serves its purpose.
Flow: Collect -> Consent; Consent -> Process; Process -> Bias controls; Bias controls -> Share; Share -> Access control; Access control -> Archive; Archive -> Retention
Interactive tool
Spot data risks
Review short scenarios and identify the risks and consequences.
Spot data risks is interactive. Enable JavaScript to use the tool.
Retrieval check
Quick check. Risk, ethics, and strategic value
What is data risk beyond security
Misuse, misinterpretation, or harm from poor handling.
Scenario. A vendor offers a discount if you share richer customer data. What question should you ask first
Does this fit the original purpose and consent. If not, the right answer is usually no, or a redesigned consent and minimisation approach.
Why do ethics matter
To ensure data use respects people and purpose.
Where do risks appear
All along the lifecycle. Collect, process, share, and retain.
What builds long term value
Clear purpose, stewardship, and honest communication.
Why log changes and outliers
Small habits catch issues before they spread.
Intermediate will reuse the data quality sandbox, the data flow visualiser, and the data interpretation exercises introduced in Foundations. They will gain more options, but the familiar interfaces stay so you can focus on thinking, not navigation.
A dashboard is a user interface for decision-making. If the UX is confusing, you do not get slightly worse decisions. You get busy people making fast guesses.
A chart uses a truncated y-axis so a small change looks dramatic. The story is compelling, funding arrives, and later the organisation realises the effect size was tiny.
This is not only a chart problem. It is a governance problem, because the decision process did not demand context and uncertainty.
Checklist
Communication failure patterns
These errors cause confident misinterpretation.
- Truncated axes: Missing baselines exaggerate change and distort priority decisions.
- Unclear time windows: Users compare unlike periods and infer false trends.
- Mixed units on one chart: Counts, rates, and percentages require explicit separation or annotation.
- Inaccessible colour design: Poor contrast or colour-only encoding excludes users and hides insight.
Checklist
Chart quality checklist
Use this checklist before publishing dashboards to leadership.
- Unit clarity: A reader should know units immediately without guessing.
- Population and time-window clarity: State who is included and what timeframe is represented.
- Scale integrity: Use honest, consistent scales across comparable views.
- Insight plus caveat sentence: State one key conclusion and one limitation in plain language.
This level is CPD-friendly because it builds professional judgement. The best evidence is not “I read a page”. It is “I designed a control, checked a failure mode, and changed how I will work”.
Checklist
CPD evidence template
Capture one honest, reusable record from this level.
- What I studied: Pipelines, governance, interoperability, analysis, and risk.
- What I applied: One pipeline diagram for a real dataset plus one silent-failure check.
- What I learned: One insight about schema, keys, or definitions that will change your standard practice.
- Evidence artefact: Attach the diagram with a short note on assumptions and verification checks.