CPD timing for this level

Intermediate time breakdown

This is the first pass of a defensible timing model for this level, based on what is actually on the page: reading, labs, checkpoints, and reflection.

Reading

29m

4,388 words · base 22m × 1.3

Labs

75m

5 activities × 15m

Checkpoints

25m

5 blocks × 5m

Reflection

40m

5 modules × 8m

Estimated guided time

3h 49m

Based on page content and disclosed assumptions.

Claimed level hours

10h

Claim includes reattempts, deeper practice, and capstone work.

The claimed hours are higher than the current on-page estimate by about 7h. That gap is where I will add more guided practice and assessment-grade work so the hours are earned, not declared.

What changes at this level

Level expectations

I want each level to feel independent, but also clearly deeper than the last. This panel makes the jump explicit so the value is obvious.

Anchor standards (course wide)

DAMA-DMBOK (data management framework)UK GDPR and ICO guidance (where privacy matters)

Assessment intent

Applied

Schemas, pipelines, and trust signals.

Assessment style

Format: scenario

Pass standard

Coming next

Not endorsed by a certification body. This is my marking standard for consistency and CPD evidence.

Evidence you can save (CPD friendly)

A small pipeline design with failure modes and the detection signal for each hop.
A schema contract note: key fields, constraints, and how you handle backwards compatibility.
A governance decision log: one rule you introduced, why it matters, and what evidence proves it is working.

Data Intermediate

Level progress0%

CPD tracking

Fixed hours for this level: 10. Timed assessment time is included once on pass.

View in My CPD

Pricing and CPD Sign in to record progress

Progress minutes

0.0 hours

This level is about real organisations: contracts, governance, pipelines, and evidence. It maps well to:

DAMA DMBOK and CDMP style expectations (governance, stewardship, quality, lineage)
BCS data and analysis oriented professional skills (clarity, communication, defensible decision-making)
Cloud data engineering tracks (AWS, Azure, Google Cloud) for architecture, pipelines, and operating reality

How to use Data Intermediate

This is where you stop being impressed by dashboards and start asking whether the data deserves trust.

Good practice

Treat every dataset like a service: it has an owner, a contract, and quality guarantees. If those do not exist, you are relying on luck.

Bad practice

Best practice

This level moves from concepts to systems. The focus is on design, governance, interoperability, and analysis that fit real organisations. The notes are written as my own checklist so you start thinking like a data professional, not just a tool user.

🏗️
Data architectures and pipelines

Concept block

Pipeline with contracts

A pipeline is safer when interfaces are explicit and tested.

Assumptions

Contracts are versioned

Monitoring exists

Failure modes

Breaking downstream

Data drift

Data architecture is how data is organised, moved, and protected across systems. It sets the lanes so teams can build without tripping over each other. Pipelines exist because raw data is messy and scattered. They pull from sources, clean and combine, and land it where people and products can use it.

There are two broad ways data moves. Batch means scheduled chunks. Streaming means small events flowing continuously. Both need clear boundaries so one team's changes do not break another's work. If a pipeline fails, dashboards go blank, models drift, and trust drops.

When you design a pipeline, think about ownership at each hop, the contracts between steps, and how to recover when something breaks. A simple diagram often exposes gaps before a single line of code is written.

Worked example. A pipeline that fails silently is worse than one that fails loudly

Imagine a daily batch pipeline that loads meter readings. One day, a source system changes a column name from meter_id to meterId. The ingestion step still runs. The storage step still runs. Your dashboard still loads. It just starts showing zeros because the join keys no longer match.

My opinion: silent failure is the main enemy of data work. It looks like success and it teaches people to distrust the whole system. If you build only one thing into a pipeline, build a check that screams when the shape or meaning changes.

Common mistakes (pipeline edition)

Building ingestion without a contract (types, required fields, allowed ranges, units).
Treating “batch vs streaming” as a fashion choice rather than a latency and reliability decision.
Forgetting recovery. If the pipeline fails at 02:00, what happens at 09:00 when people open the dashboard.
Not naming an owner for each step, then acting surprised when nothing gets fixed.

Verification. Prove you can reason about it

Draw a pipeline for one real dataset you know. Identify one failure mode per hop.
Write one “contract sentence” for ingestion. Example: “Every record must have a timestamp in UTC and a meter identifier treated as a string.”
Decide whether the dataset is batch or streaming and justify it with a latency requirement.

Simple data flow

Sources to insight with clear boundaries

Sources

Apps, sensors, files.

Ingestion

Collect, queue, validate.

Storage

Databases, lakes, warehouse.

Processing

Clean, join, aggregate.

Consumption

Dashboards, models, APIs.

Quick check. Architectures and pipelines

Why do pipelines exist

Scenario: A dashboard is correct at 09:00 and wrong at 10:00. Name one pipeline failure mode that fits

What is batch movement

What is streaming movement

Scenario: A producer changes a field name and consumers silently break. What boundary was missing

Why start with a diagram

📋
Data governance and stewardship

Concept block

Governance as an operating model

Governance works when it connects policy to enforcement and evidence.

Assumptions

Policies are enforceable

Stewards have authority

Failure modes

Policy without tooling

Approval bottlenecks

Governance is agreeing how data is handled so people can work quickly without being reckless. Ownership is the person or team that decides purpose and access. Stewardship is the day to day care of definitions, quality, and metadata. Accountability means someone can explain how a change was made and why.

Policies are not paperwork for their own sake. They are guardrails. Who can see a column, how long data is kept, what checks run before sharing. Even a shared spreadsheet is governance in miniature. Who edits, who reviews, what happens when something looks off.

Trust grows when policies are clear, controls are enforced, and feedback loops exist. If a report is wrong and nobody owns it, confidence collapses quickly.

Worked example. The spreadsheet is still governance

If a team shares a spreadsheet called “final_final_v7”, that is governance, just done badly. There is still access control (who has the link), retention (how long it stays in inboxes), and change control (who overwrote which cell). The only difference is that it is informal, invisible, and impossible to audit.

My opinion: governance should feel like good design. It should make the safe thing the easy thing. When governance feels like punishment, teams route around it and create risk you cannot even see.

Common mistakes (governance edition)

Writing policies that do not match reality, then ignoring the exceptions until they become the norm.
Treating “owner” as a title rather than a responsibility with time allocated.
Relying on manual checks for things that should be automated, like schema changes and row count drift.
Giving everyone access “temporarily” and then never removing it.

Verification. Prove it is not just words

Choose one dataset and write: purpose, owner, steward, retention period, and who can access it.
Define one check that would catch a breaking schema change.
Write one sentence explaining what “accountability” would look like if the report was wrong.

Governance loop

People, policies, data, decisions

People

Owner, steward, users.

Policies

Access, retention, checks.

Data assets

Tables, files, dashboards.

Decision loop

Use, review, improve.

Quick check. Governance and stewardship

Why does governance exist

What does ownership mean

What does stewardship mean

Why are policies useful

Scenario: A team asks for full access “temporarily” to ship a feature. What is a safer governance response

What happens without governance

🔗
Interoperability and standards

Concept block

Interoperability boundary

Standards reduce translation work by creating stable interfaces and shared meaning.

Assumptions

Terms are mapped

Interfaces are stable

Failure modes

Translation drift

Standards as theatre

Interoperability means systems understand each other. It is shared meaning, not just shared pipes. Standards help through common formats, schemas, and naming. A schema is the agreed structure and data types. When systems misalign, data lands in the wrong fields, numbers become strings, or meaning gets lost.

Formats like JSON or CSV carry data, but standards and contracts explain what each field means. An API (application programming interface) without a contract is guesswork. A file without a schema requires detective work.

Small mismatches cause big effects. Dates in different orders, currencies missing, names split differently. Aligning schemas early saves hours of cleanup and prevents silent errors.

Worked example. Join keys are mathematics, not vibes

A join works only if the key represents the same thing on both sides. That sounds obvious until you meet real data. One system uses customer_id as “account holder”. Another uses it as “billing contact”. Both are “customer”, until you try to reconcile charges.

My opinion: schema alignment is not a technical chore. It is a meaning negotiation. You need the business definition written down, not just the column name.

Maths ladder. Cardinality and why joins explode

Foundations. One-to-one, one-to-many, many-to-many

Cardinality describes how many records relate to how many records.

One-to-one: each record matches at most one record on the other side.
One-to-many: one record matches many on the other side.
Many-to-many: many records match many, which can multiply rows quickly.

Undergraduate. Why many-to-many joins can blow up row counts

If a key value

k

appears

a_k

times in table A and

b_k

times in table B, then an inner join produces

a_k \cdot b_k

rows for that key. Total join rows across all keys is:

\sum_k a_k b_k

This is why “duplicated keys” are not a small detail. They can change both correctness and performance.

Verification. What you check before you trust a join

Confirm key definitions match (meaning, not only type).
Check uniqueness on both sides. If not unique, decide if duplication is expected and safe.
Compare row counts before and after the join and explain the change.

Interoperability check

Alignment vs mismatch

Aligned

Fields match, types agree, meaning is clear.

Mismatch

Wrong types, missing fields, swapped meanings.

Quick check. Interoperability and standards

What is interoperability

What is a schema

Scenario: Two systems both have a field called `date` but one means local time and one means UTC. What should you do

How do standards help

What happens when schemas mismatch

Why do contracts matter for APIs

📊
Data analysis and insight generation

Concept block

Metric choice is a decision

Analysis is not only calculation. It is choosing what question you are answering.

Assumptions

The metric reflects reality

Uncertainty is acknowledged

Failure modes

Spurious correlation

Goodhart effects

Analysis is asking good questions of data and checking that the answers hold up. Descriptive thinking asks what happened. Diagnostic thinking asks why. Statistics exist to separate signal from noise. Averages summarise, distributions show spread, trends show direction.

Averages hide detail. A long tail or a split between groups can change the story. Trends can be seasonal or random. Always pair a number with context. When it was measured, who is included, what changed.

Insight is not a chart. It is a statement backed by data and understanding. Decisions follow insight, and they should note assumptions so they can be revisited.

Worked example. Correlation is not a permission slip for causation

If two things move together, it might be causation, or it might be a shared driver, or it might be coincidence. In real organisations this becomes painful when a dashboard shows “A rose, then B rose”, and someone writes a strategy based on it.

My opinion: the best analysts are sceptical. They do not say “the chart says”. They say “the chart suggests, under these assumptions”.

Maths ladder. From intuition to inference

Foundations. Mean, median, and why you need both

The mean

\bar{x}

is sensitive to outliers. The median is the middle value when sorted. If the mean and median disagree strongly, that is a clue that the distribution is skewed or has outliers.

A level. Correlation (Pearson) and what it measures

Pearson correlation between variables

X

and

Y

can be written as:

r = \frac{\mathrm{cov}(X,Y)}{\sigma_X \sigma_Y}

$\mathrm{cov}(X,Y)$ : covariance (how the variables vary together)
$\sigma_X, \sigma_Y$ : standard deviations of $X$ and $Y$
$r$ : correlation coefficient in $[-1, 1]$

Interpretation:

r

measures linear association. It does not tell you direction of causality, and it can be distorted by outliers.

Undergraduate. A minimal taste of hypothesis testing

A typical structure:

$H_0$ : a null hypothesis (for example: no difference between groups)
$H_1$ : an alternative hypothesis (there is a difference)
Compute a test statistic from data and derive a p-value under $H_0$

The p-value is not “the probability the null is true”. It is the probability of observing data as extreme as yours (or more) assuming

H_0

is true. This distinction matters because people misuse p-values constantly.

Common mistakes (analysis edition)

Treating correlation as causation without a design that supports causal claims.
Reporting one metric without uncertainty, spread, or context.
Comparing groups with different baselines and calling it “performance”.
Forgetting that the definition of the metric can change (new logging, new filtering, new exclusions).

Verification. Prove your insight is defensible

Write your insight as a sentence and list the assumptions underneath it.
Show one counterexample or alternative explanation you checked.
State what data would change your mind.

From data to decision

Question, summarise, decide

Raw data

Events, readings, logs.

Aggregation

Group, filter, summarise.

Insight

Statement linked to evidence.

Decision

Action, experiment, follow up.

Quick check. Analysis and insight

What is descriptive thinking

What is diagnostic thinking

Scenario: The average looks fine but users complain. What should you check next

Why does context matter

What is an insight

🎲
Probability and distributions (uncertainty without the panic)

Concept block

Distributions shape decisions

A mean is not a system. Distributions show variability and risk.

Assumptions

Variation matters

Outliers are explained

Failure modes

Mean worship

Ignoring skew

Data work is mostly uncertainty management. Probability is how we stay honest about that. You do not need to love maths to use probability well. You need to be disciplined about what you are claiming.

Worked example. “It usually works” is not a reliability statement

If a pipeline succeeds 99% of the time, it still fails 1 day in 100. Over a year that is multiple failures. The question is not “is 99 good”. The question is “what happens on the failure days, and what does it cost”.

Common mistakes with probability

Mixing percentage and probability in storage and then comparing wrong numbers.
Treating rare events as impossible because you have not seen them yet.
Assuming normal distributions for everything. Many real-world systems have heavy tails.

Verification. A simple sanity check

If an event happens with probability 1%, how often would you expect it over 10,000 runs.
If your monitoring only samples 1% of events, what might you miss.

🧪
Inference, sampling, and experiments

Concept block

Inference needs design

Inference works when you design what you can claim, and how you will test it.

Assumptions

Sampling is honest

Claims are bounded

Failure modes

P-hacking behaviour

Ignoring confounders

Inference is the art of learning about a bigger reality from limited observations. This matters because most datasets are not the full world. They are a sample, often a biased one.

Worked example. The “successful customers” dataset that hides the problem

You analyse only customers who completed a journey because that is what is easy to track. Your dashboard shows high satisfaction. The people who dropped off never appear, so the system looks healthier than it is.

Common mistakes in inference

Confusing “we observed it” with “it is true for everyone”.
Reporting point estimates without uncertainty or sample size.
Treating A/B tests as truth machines without checking bias and instrumentation.

Verification. Ask the sampling questions

Who is included. Who is missing.
What would cause a person or event to drop out of the dataset.
If the measurement process changes, how would you detect it.

🤖
Modelling basics (regression, classification, and evaluation)

Concept block

Model choices

A model is a simplified story of the world. Pick the story that fits the decision.

Assumptions

Models are tested on reality

Errors are understood

Failure modes

Overfitting

Wrong objective

Modelling is not magic. It is choosing inputs, choosing an objective, and checking failure modes. The purpose of modelling is not to impress people. It is to make a useful prediction with known limitations.

Worked example. 99% accuracy that is still useless

If only 1% of cases are fraud, a model that always predicts “not fraud” gets 99% accuracy. That is why evaluation needs multiple metrics and a clear cost model for errors.

Common mistakes in modelling

Leakage: the model sees a proxy for the answer.
Optimising one metric and ignoring harm elsewhere (false positives, workload, trust).
Choosing a threshold once and never revisiting when behaviour changes.

Verification. A minimal model review

What is the label and who decides it.
What are the top 3 features, and what proxies might they represent.
What is the cost of false positives and false negatives.
What does “human in the loop” mean here, in practice.

📦
Data as a product (making datasets usable, not just available)

Concept block

Data as a product

Data is a product when it has an owner, an interface, and support expectations.

Assumptions

Ownership is stable

Interfaces are versioned

Failure modes

Unowned datasets

Support as hero work

A mature organisation treats important datasets like products. They have owners, documentation, quality expectations, and support. This is how you reduce “shadow spreadsheets” and make reuse normal.

Worked example. The “can you send me the extract” culture

If every request becomes a one-off extract, you are not serving data. You are doing bespoke reporting at scale. A data product replaces that with a stable interface, clear meaning, and quality guarantees.

Verification. Write a data product page in five lines

Name: what it is and what it is not.
Owner: who is accountable.
Refresh: how often it updates.
Quality: what checks run.
Access: who can use it and why.

⚖️
Risk, ethics and strategic value

Concept block

Risk and value together

Value without safeguards creates harm. Safeguards without value create waste.

Assumptions

Risk is revisited

Value is measurable

Failure modes

Set and forget

Vanity metrics

Data risk is broader than security. Misuse, misinterpretation, and neglect can harm people and decisions. Ethics asks whether we should use data in a certain way, not just whether we can. Strategic value comes from using data to improve services, not just to collect more of it.

Risk points appear along the lifecycle. Collection without consent, processing without checks, sharing beyond purpose, keeping data forever. Controls and culture reduce these risks. A small habit, like logging changes or reviewing outliers, prevents large mistakes.

Treat data as a long term asset. Good stewardship, clear value cases, and honest communication build trust that lasts beyond a single project.

Verification and reflection. Show professional judgement

Pick one risky scenario from the tool below and write the “least bad” option, with a justification.
Name the stakeholder most likely to be harmed if you get this wrong, and what “harm” looks like in practice.
Describe one control you would add that is realistic for a small team, not only for large enterprises.

Risk and ethics view

Lifecycle with careful choices

Collect

Consent and purpose confirmed.

Process

Checks for bias and errors.

Controlled access and logs.

Archive or delete

Retention reviewed, data minimised.

Quick check. Risk, ethics, and strategic value

What is data risk beyond security

Scenario: A vendor offers a discount if you share richer customer data. What question should you ask first

Why do ethics matter

Where do risks appear

What builds long term value

Why log changes and outliers

📊
Shared dashboards you already know

Intermediate will reuse the data quality sandbox, the data flow visualiser, and the data interpretation exercises introduced in Foundations. They will gain more options, but the familiar interfaces stay so you can focus on thinking, not navigation.

Visualisation and communication (the part that decides whether anyone listens)

A dashboard is a user interface for decision-making. If the UX is confusing, you do not get slightly worse decisions. You get busy people making fast guesses.

Worked example. The misleading chart that got a team funded

A chart uses a truncated y-axis so a small change looks dramatic. The story is compelling, funding arrives, and later the organisation realises the effect size was tiny. This is not only a chart problem. It is a governance problem, because the decision process did not demand context and uncertainty.

Common mistakes in data communication

Truncated axes and missing baselines.
No time window clarity, so people compare unlike periods.
Mixing counts, rates, and percentages on one chart without explanation.
Colour choices that fail for colour-blind users or overload attention.

Verification. A chart quality checklist

Can a reader tell what the unit is without guessing.
Can a reader tell what time window and population are included.
Is the scale honest and consistent.
Is there at least one sentence that states the insight and the caveat.

🧾
CPD evidence (keep it honest and specific)

This level is CPD-friendly because it builds professional judgement. The best evidence is not “I read a page”. It is “I designed a control, checked a failure mode, and changed how I will work”.

What I studied: pipelines, governance, interoperability, analysis, and risk.
What I applied: one pipeline diagram for a real dataset and one check that would catch silent failure.
What I learned: one insight about meaning contracts (schema, keys, or definitions) that I will now insist on.
Evidence artefact: the diagram plus a short note on assumptions and checks.

Data Intermediate

Intermediate time breakdown

Level expectations

Data Intermediate

🏗️Data architectures and pipelines

Worked example. A pipeline that fails silently is worse than one that fails loudly

Common mistakes (pipeline edition)

Verification. Prove you can reason about it

📋Data governance and stewardship

Worked example. The spreadsheet is still governance

Common mistakes (governance edition)

Verification. Prove it is not just words

🔗Interoperability and standards

Worked example. Join keys are mathematics, not vibes

Maths ladder. Cardinality and why joins explode

Foundations. One-to-one, one-to-many, many-to-many

Undergraduate. Why many-to-many joins can blow up row counts

Verification. What you check before you trust a join

📊Data analysis and insight generation

Worked example. Correlation is not a permission slip for causation

Maths ladder. From intuition to inference

Foundations. Mean, median, and why you need both

A level. Correlation (Pearson) and what it measures

Undergraduate. A minimal taste of hypothesis testing

Common mistakes (analysis edition)

Verification. Prove your insight is defensible

🎲Probability and distributions (uncertainty without the panic)

Worked example. “It usually works” is not a reliability statement

Common mistakes with probability

Verification. A simple sanity check

🧪Inference, sampling, and experiments

Worked example. The “successful customers” dataset that hides the problem

Common mistakes in inference

Verification. Ask the sampling questions

🤖Modelling basics (regression, classification, and evaluation)

Worked example. 99% accuracy that is still useless

Common mistakes in modelling

Verification. A minimal model review

📦Data as a product (making datasets usable, not just available)

Worked example. The “can you send me the extract” culture

Verification. Write a data product page in five lines

⚖️Risk, ethics and strategic value

Verification and reflection. Show professional judgement

📊Shared dashboards you already know

Visualisation and communication (the part that decides whether anyone listens)

Worked example. The misleading chart that got a team funded

Common mistakes in data communication

Verification. A chart quality checklist

🧾CPD evidence (keep it honest and specific)

Quick feedback

🏗️
Data architectures and pipelines

📋
Data governance and stewardship

🔗
Interoperability and standards

📊
Data analysis and insight generation

🎲
Probability and distributions (uncertainty without the panic)

🧪
Inference, sampling, and experiments

🤖
Modelling basics (regression, classification, and evaluation)

📦
Data as a product (making datasets usable, not just available)

⚖️
Risk, ethics and strategic value

📊
Shared dashboards you already know

🧾
CPD evidence (keep it honest and specific)