CPD timing for this level

Intermediate time breakdown

This is the first pass of a defensible timing model for this level, based on what is actually on the page: reading, labs, checkpoints, and reflection.

Reading
29m
4,388 words · base 22m × 1.3
Labs
75m
5 activities × 15m
Checkpoints
25m
5 blocks × 5m
Reflection
40m
5 modules × 8m
Estimated guided time
3h 49m
Based on page content and disclosed assumptions.
Claimed level hours
10h
Claim includes reattempts, deeper practice, and capstone work.
The claimed hours are higher than the current on-page estimate by about 7h. That gap is where I will add more guided practice and assessment-grade work so the hours are earned, not declared.

What changes at this level

Level expectations

I want each level to feel independent, but also clearly deeper than the last. This panel makes the jump explicit so the value is obvious.

Anchor standards (course wide)
DAMA-DMBOK (data management framework)UK GDPR and ICO guidance (where privacy matters)
Assessment intent
Applied

Schemas, pipelines, and trust signals.

Assessment style
Format: scenario
Pass standard
Coming next

Not endorsed by a certification body. This is my marking standard for consistency and CPD evidence.

Evidence you can save (CPD friendly)
  • A small pipeline design with failure modes and the detection signal for each hop.
  • A schema contract note: key fields, constraints, and how you handle backwards compatibility.
  • A governance decision log: one rule you introduced, why it matters, and what evidence proves it is working.

Data Intermediate

Level progress0%

CPD tracking

Fixed hours for this level: 10. Timed assessment time is included once on pass.

View in My CPD
Progress minutes
0.0 hours

CPD and certification alignment (guidance, not endorsed)

This level is about real organisations: contracts, governance, pipelines, and evidence. It maps well to:

  • DAMA DMBOK and CDMP style expectations (governance, stewardship, quality, lineage)
  • BCS data and analysis oriented professional skills (clarity, communication, defensible decision-making)
  • Cloud data engineering tracks (AWS, Azure, Google Cloud) for architecture, pipelines, and operating reality
How to use Data Intermediate
This is where you stop being impressed by dashboards and start asking whether the data deserves trust.
Good practice
Treat every dataset like a service: it has an owner, a contract, and quality guarantees. If those do not exist, you are relying on luck.
Bad practice
Best practice

This level moves from concepts to systems. The focus is on design, governance, interoperability, and analysis that fit real organisations. The notes are written as my own checklist so you start thinking like a data professional, not just a tool user.


🏗️

Data architectures and pipelines

Concept block
Pipeline with contracts
A pipeline is safer when interfaces are explicit and tested.
A pipeline is safer when interfaces are explicit and tested.
Assumptions
Contracts are versioned
Monitoring exists
Failure modes
Breaking downstream
Data drift

Data architecture is how data is organised, moved, and protected across systems. It sets the lanes so teams can build without tripping over each other. Pipelines exist because raw data is messy and scattered. They pull from sources, clean and combine, and land it where people and products can use it.

There are two broad ways data moves. Batch means scheduled chunks. Streaming means small events flowing continuously. Both need clear boundaries so one team's changes do not break another's work. If a pipeline fails, dashboards go blank, models drift, and trust drops.

When you design a pipeline, think about ownership at each hop, the contracts between steps, and how to recover when something breaks. A simple diagram often exposes gaps before a single line of code is written.

Worked example. A pipeline that fails silently is worse than one that fails loudly

Imagine a daily batch pipeline that loads meter readings. One day, a source system changes a column name from meter_id to meterId. The ingestion step still runs. The storage step still runs. Your dashboard still loads. It just starts showing zeros because the join keys no longer match.

My opinion: silent failure is the main enemy of data work. It looks like success and it teaches people to distrust the whole system. If you build only one thing into a pipeline, build a check that screams when the shape or meaning changes.

Common mistakes (pipeline edition)

  • Building ingestion without a contract (types, required fields, allowed ranges, units).
  • Treating “batch vs streaming” as a fashion choice rather than a latency and reliability decision.
  • Forgetting recovery. If the pipeline fails at 02:00, what happens at 09:00 when people open the dashboard.
  • Not naming an owner for each step, then acting surprised when nothing gets fixed.

Verification. Prove you can reason about it

  • Draw a pipeline for one real dataset you know. Identify one failure mode per hop.
  • Write one “contract sentence” for ingestion. Example: “Every record must have a timestamp in UTC and a meter identifier treated as a string.”
  • Decide whether the dataset is batch or streaming and justify it with a latency requirement.

Simple data flow

Sources to insight with clear boundaries

Sources

Apps, sensors, files.

Ingestion

Collect, queue, validate.

Storage

Databases, lakes, warehouse.

Processing

Clean, join, aggregate.

Consumption

Dashboards, models, APIs.

Quick check. Architectures and pipelines

Why do pipelines exist

Scenario: A dashboard is correct at 09:00 and wrong at 10:00. Name one pipeline failure mode that fits

What is batch movement

What is streaming movement

Scenario: A producer changes a field name and consumers silently break. What boundary was missing

Why start with a diagram

📋

Data governance and stewardship

Concept block
Governance as an operating model
Governance works when it connects policy to enforcement and evidence.
Governance works when it connects policy to enforcement and evidence.
Assumptions
Policies are enforceable
Stewards have authority
Failure modes
Policy without tooling
Approval bottlenecks

Governance is agreeing how data is handled so people can work quickly without being reckless. Ownership is the person or team that decides purpose and access. Stewardship is the day to day care of definitions, quality, and metadata. Accountability means someone can explain how a change was made and why.

Policies are not paperwork for their own sake. They are guardrails. Who can see a column, how long data is kept, what checks run before sharing. Even a shared spreadsheet is governance in miniature. Who edits, who reviews, what happens when something looks off.

Trust grows when policies are clear, controls are enforced, and feedback loops exist. If a report is wrong and nobody owns it, confidence collapses quickly.

Worked example. The spreadsheet is still governance

If a team shares a spreadsheet called “final_final_v7”, that is governance, just done badly. There is still access control (who has the link), retention (how long it stays in inboxes), and change control (who overwrote which cell). The only difference is that it is informal, invisible, and impossible to audit.

My opinion: governance should feel like good design. It should make the safe thing the easy thing. When governance feels like punishment, teams route around it and create risk you cannot even see.

Common mistakes (governance edition)

  • Writing policies that do not match reality, then ignoring the exceptions until they become the norm.
  • Treating “owner” as a title rather than a responsibility with time allocated.
  • Relying on manual checks for things that should be automated, like schema changes and row count drift.
  • Giving everyone access “temporarily” and then never removing it.

Verification. Prove it is not just words

  • Choose one dataset and write: purpose, owner, steward, retention period, and who can access it.
  • Define one check that would catch a breaking schema change.
  • Write one sentence explaining what “accountability” would look like if the report was wrong.

Governance loop

People, policies, data, decisions

People

Owner, steward, users.

Policies

Access, retention, checks.

Data assets

Tables, files, dashboards.

Decision loop

Use, review, improve.

Quick check. Governance and stewardship

Why does governance exist

What does ownership mean

What does stewardship mean

Why are policies useful

Scenario: A team asks for full access “temporarily” to ship a feature. What is a safer governance response

What happens without governance

🔗

Interoperability and standards

Concept block
Interoperability boundary
Standards reduce translation work by creating stable interfaces and shared meaning.
Standards reduce translation work by creating stable interfaces and shared meaning.
Assumptions
Terms are mapped
Interfaces are stable
Failure modes
Translation drift
Standards as theatre

Interoperability means systems understand each other. It is shared meaning, not just shared pipes. Standards help through common formats, schemas, and naming. A schema is the agreed structure and data types. When systems misalign, data lands in the wrong fields, numbers become strings, or meaning gets lost.

Formats like JSON or CSV carry data, but standards and contracts explain what each field means. An API (application programming interface) without a contract is guesswork. A file without a schema requires detective work.

Small mismatches cause big effects. Dates in different orders, currencies missing, names split differently. Aligning schemas early saves hours of cleanup and prevents silent errors.

Worked example. Join keys are mathematics, not vibes

A join works only if the key represents the same thing on both sides. That sounds obvious until you meet real data. One system uses customer_id as “account holder”. Another uses it as “billing contact”. Both are “customer”, until you try to reconcile charges.

My opinion: schema alignment is not a technical chore. It is a meaning negotiation. You need the business definition written down, not just the column name.

Maths ladder. Cardinality and why joins explode

Foundations. One-to-one, one-to-many, many-to-many

Cardinality describes how many records relate to how many records.

  • One-to-one: each record matches at most one record on the other side.
  • One-to-many: one record matches many on the other side.
  • Many-to-many: many records match many, which can multiply rows quickly.

Undergraduate. Why many-to-many joins can blow up row counts

If a key value kk appears aka_k times in table A and bkb_k times in table B, then an inner join produces akbka_k \cdot b_k rows for that key. Total join rows across all keys is:

kakbk\sum_k a_k b_k

This is why “duplicated keys” are not a small detail. They can change both correctness and performance.

Verification. What you check before you trust a join

  • Confirm key definitions match (meaning, not only type).
  • Check uniqueness on both sides. If not unique, decide if duplication is expected and safe.
  • Compare row counts before and after the join and explain the change.

Interoperability check

Alignment vs mismatch

Aligned

Fields match, types agree, meaning is clear.

Mismatch

Wrong types, missing fields, swapped meanings.

Quick check. Interoperability and standards

What is interoperability

What is a schema

Scenario: Two systems both have a field called `date` but one means local time and one means UTC. What should you do

How do standards help

What happens when schemas mismatch

Why do contracts matter for APIs

📊

Data analysis and insight generation

Concept block
Metric choice is a decision
Analysis is not only calculation. It is choosing what question you are answering.
Analysis is not only calculation. It is choosing what question you are answering.
Assumptions
The metric reflects reality
Uncertainty is acknowledged
Failure modes
Spurious correlation
Goodhart effects

Analysis is asking good questions of data and checking that the answers hold up. Descriptive thinking asks what happened. Diagnostic thinking asks why. Statistics exist to separate signal from noise. Averages summarise, distributions show spread, trends show direction.

Averages hide detail. A long tail or a split between groups can change the story. Trends can be seasonal or random. Always pair a number with context. When it was measured, who is included, what changed.

Insight is not a chart. It is a statement backed by data and understanding. Decisions follow insight, and they should note assumptions so they can be revisited.

Worked example. Correlation is not a permission slip for causation

If two things move together, it might be causation, or it might be a shared driver, or it might be coincidence. In real organisations this becomes painful when a dashboard shows “A rose, then B rose”, and someone writes a strategy based on it.

My opinion: the best analysts are sceptical. They do not say “the chart says”. They say “the chart suggests, under these assumptions”.

Maths ladder. From intuition to inference

Foundations. Mean, median, and why you need both

The mean xˉ\bar{x} is sensitive to outliers. The median is the middle value when sorted. If the mean and median disagree strongly, that is a clue that the distribution is skewed or has outliers.

A level. Correlation (Pearson) and what it measures

Pearson correlation between variables XX and YY can be written as:

r=cov(X,Y)σXσYr = \frac{\mathrm{cov}(X,Y)}{\sigma_X \sigma_Y}
  • cov(X,Y)\mathrm{cov}(X,Y): covariance (how the variables vary together)
  • σX,σY\sigma_X, \sigma_Y: standard deviations of XX and YY
  • rr: correlation coefficient in [1,1][-1, 1]

Interpretation: rr measures linear association. It does not tell you direction of causality, and it can be distorted by outliers.

Undergraduate. A minimal taste of hypothesis testing

A typical structure:

  • H0H_0: a null hypothesis (for example: no difference between groups)
  • H1H_1: an alternative hypothesis (there is a difference)
  • Compute a test statistic from data and derive a p-value under H0H_0

The p-value is not “the probability the null is true”. It is the probability of observing data as extreme as yours (or more) assuming H0H_0 is true. This distinction matters because people misuse p-values constantly.

Common mistakes (analysis edition)

  • Treating correlation as causation without a design that supports causal claims.
  • Reporting one metric without uncertainty, spread, or context.
  • Comparing groups with different baselines and calling it “performance”.
  • Forgetting that the definition of the metric can change (new logging, new filtering, new exclusions).

Verification. Prove your insight is defensible

  • Write your insight as a sentence and list the assumptions underneath it.
  • Show one counterexample or alternative explanation you checked.
  • State what data would change your mind.

From data to decision

Question, summarise, decide

Raw data

Events, readings, logs.

Aggregation

Group, filter, summarise.

Insight

Statement linked to evidence.

Decision

Action, experiment, follow up.

Quick check. Analysis and insight

What is descriptive thinking

What is diagnostic thinking

Scenario: The average looks fine but users complain. What should you check next

Why does context matter

What is an insight

🎲

Probability and distributions (uncertainty without the panic)

Concept block
Distributions shape decisions
A mean is not a system. Distributions show variability and risk.
A mean is not a system. Distributions show variability and risk.
Assumptions
Variation matters
Outliers are explained
Failure modes
Mean worship
Ignoring skew

Data work is mostly uncertainty management. Probability is how we stay honest about that. You do not need to love maths to use probability well. You need to be disciplined about what you are claiming.

Worked example. “It usually works” is not a reliability statement

If a pipeline succeeds 99% of the time, it still fails 1 day in 100. Over a year that is multiple failures. The question is not “is 99 good”. The question is “what happens on the failure days, and what does it cost”.

Common mistakes with probability

  • Mixing percentage and probability in storage and then comparing wrong numbers.
  • Treating rare events as impossible because you have not seen them yet.
  • Assuming normal distributions for everything. Many real-world systems have heavy tails.

Verification. A simple sanity check

  • If an event happens with probability 1%, how often would you expect it over 10,000 runs.
  • If your monitoring only samples 1% of events, what might you miss.

🧪

Inference, sampling, and experiments

Concept block
Inference needs design
Inference works when you design what you can claim, and how you will test it.
Inference works when you design what you can claim, and how you will test it.
Assumptions
Sampling is honest
Claims are bounded
Failure modes
P-hacking behaviour
Ignoring confounders

Inference is the art of learning about a bigger reality from limited observations. This matters because most datasets are not the full world. They are a sample, often a biased one.

Worked example. The “successful customers” dataset that hides the problem

You analyse only customers who completed a journey because that is what is easy to track. Your dashboard shows high satisfaction. The people who dropped off never appear, so the system looks healthier than it is.

Common mistakes in inference

  • Confusing “we observed it” with “it is true for everyone”.
  • Reporting point estimates without uncertainty or sample size.
  • Treating A/B tests as truth machines without checking bias and instrumentation.

Verification. Ask the sampling questions

  • Who is included. Who is missing.
  • What would cause a person or event to drop out of the dataset.
  • If the measurement process changes, how would you detect it.

🤖

Modelling basics (regression, classification, and evaluation)

Concept block
Model choices
A model is a simplified story of the world. Pick the story that fits the decision.
A model is a simplified story of the world. Pick the story that fits the decision.
Assumptions
Models are tested on reality
Errors are understood
Failure modes
Overfitting
Wrong objective

Modelling is not magic. It is choosing inputs, choosing an objective, and checking failure modes. The purpose of modelling is not to impress people. It is to make a useful prediction with known limitations.

Worked example. 99% accuracy that is still useless

If only 1% of cases are fraud, a model that always predicts “not fraud” gets 99% accuracy. That is why evaluation needs multiple metrics and a clear cost model for errors.

Common mistakes in modelling

  • Leakage: the model sees a proxy for the answer.
  • Optimising one metric and ignoring harm elsewhere (false positives, workload, trust).
  • Choosing a threshold once and never revisiting when behaviour changes.

Verification. A minimal model review

  • What is the label and who decides it.
  • What are the top 3 features, and what proxies might they represent.
  • What is the cost of false positives and false negatives.
  • What does “human in the loop” mean here, in practice.

📦

Data as a product (making datasets usable, not just available)

Concept block
Data as a product
Data is a product when it has an owner, an interface, and support expectations.
Data is a product when it has an owner, an interface, and support expectations.
Assumptions
Ownership is stable
Interfaces are versioned
Failure modes
Unowned datasets
Support as hero work

A mature organisation treats important datasets like products. They have owners, documentation, quality expectations, and support. This is how you reduce “shadow spreadsheets” and make reuse normal.

Worked example. The “can you send me the extract” culture

If every request becomes a one-off extract, you are not serving data. You are doing bespoke reporting at scale. A data product replaces that with a stable interface, clear meaning, and quality guarantees.

Verification. Write a data product page in five lines

  • Name: what it is and what it is not.
  • Owner: who is accountable.
  • Refresh: how often it updates.
  • Quality: what checks run.
  • Access: who can use it and why.

⚖️

Risk, ethics and strategic value

Concept block
Risk and value together
Value without safeguards creates harm. Safeguards without value create waste.
Value without safeguards creates harm. Safeguards without value create waste.
Assumptions
Risk is revisited
Value is measurable
Failure modes
Set and forget
Vanity metrics

Data risk is broader than security. Misuse, misinterpretation, and neglect can harm people and decisions. Ethics asks whether we should use data in a certain way, not just whether we can. Strategic value comes from using data to improve services, not just to collect more of it.

Risk points appear along the lifecycle. Collection without consent, processing without checks, sharing beyond purpose, keeping data forever. Controls and culture reduce these risks. A small habit, like logging changes or reviewing outliers, prevents large mistakes.

Treat data as a long term asset. Good stewardship, clear value cases, and honest communication build trust that lasts beyond a single project.

Verification and reflection. Show professional judgement

  • Pick one risky scenario from the tool below and write the “least bad” option, with a justification.
  • Name the stakeholder most likely to be harmed if you get this wrong, and what “harm” looks like in practice.
  • Describe one control you would add that is realistic for a small team, not only for large enterprises.

Risk and ethics view

Lifecycle with careful choices

Collect

Consent and purpose confirmed.

Process

Checks for bias and errors.

Share

Controlled access and logs.

Archive or delete

Retention reviewed, data minimised.

Quick check. Risk, ethics, and strategic value

What is data risk beyond security

Scenario: A vendor offers a discount if you share richer customer data. What question should you ask first

Why do ethics matter

Where do risks appear

What builds long term value

Why log changes and outliers


📊

Shared dashboards you already know

Intermediate will reuse the data quality sandbox, the data flow visualiser, and the data interpretation exercises introduced in Foundations. They will gain more options, but the familiar interfaces stay so you can focus on thinking, not navigation.

Visualisation and communication (the part that decides whether anyone listens)

A dashboard is a user interface for decision-making. If the UX is confusing, you do not get slightly worse decisions. You get busy people making fast guesses.

Worked example. The misleading chart that got a team funded

A chart uses a truncated y-axis so a small change looks dramatic. The story is compelling, funding arrives, and later the organisation realises the effect size was tiny. This is not only a chart problem. It is a governance problem, because the decision process did not demand context and uncertainty.

Common mistakes in data communication

  • Truncated axes and missing baselines.
  • No time window clarity, so people compare unlike periods.
  • Mixing counts, rates, and percentages on one chart without explanation.
  • Colour choices that fail for colour-blind users or overload attention.

Verification. A chart quality checklist

  • Can a reader tell what the unit is without guessing.
  • Can a reader tell what time window and population are included.
  • Is the scale honest and consistent.
  • Is there at least one sentence that states the insight and the caveat.

🧾

CPD evidence (keep it honest and specific)

This level is CPD-friendly because it builds professional judgement. The best evidence is not “I read a page”. It is “I designed a control, checked a failure mode, and changed how I will work”.

  • What I studied: pipelines, governance, interoperability, analysis, and risk.
  • What I applied: one pipeline diagram for a real dataset and one check that would catch silent failure.
  • What I learned: one insight about meaning contracts (schema, keys, or definitions) that I will now insist on.
  • Evidence artefact: the diagram plus a short note on assumptions and checks.

Quick feedback

Optional. This helps improve accuracy and usefulness. No accounts required.