CPD timing for this level
Intermediate time breakdown
This is the first pass of a defensible timing model for this level, based on what is actually on the page: reading, labs, checkpoints, and reflection.
What changes at this level
Level expectations
I want each level to feel independent, but also clearly deeper than the last. This panel makes the jump explicit so the value is obvious.
Schemas, pipelines, and trust signals.
Not endorsed by a certification body. This is my marking standard for consistency and CPD evidence.
- A small pipeline design with failure modes and the detection signal for each hop.
- A schema contract note: key fields, constraints, and how you handle backwards compatibility.
- A governance decision log: one rule you introduced, why it matters, and what evidence proves it is working.
Data Intermediate
CPD tracking
Fixed hours for this level: 10. Timed assessment time is included once on pass.
View in My CPDCPD and certification alignment (guidance, not endorsed)
This level is about real organisations: contracts, governance, pipelines, and evidence. It maps well to:
- DAMA DMBOK and CDMP style expectations (governance, stewardship, quality, lineage)
- BCS data and analysis oriented professional skills (clarity, communication, defensible decision-making)
- Cloud data engineering tracks (AWS, Azure, Google Cloud) for architecture, pipelines, and operating reality
This level moves from concepts to systems. The focus is on design, governance, interoperability, and analysis that fit real organisations. The notes are written as my own checklist so you start thinking like a data professional, not just a tool user.
🏗️Data architectures and pipelines
Data architecture is how data is organised, moved, and protected across systems. It sets the lanes so teams can build without tripping over each other. Pipelines exist because raw data is messy and scattered. They pull from sources, clean and combine, and land it where people and products can use it.
There are two broad ways data moves. Batch means scheduled chunks. Streaming means small events flowing continuously. Both need clear boundaries so one team's changes do not break another's work. If a pipeline fails, dashboards go blank, models drift, and trust drops.
When you design a pipeline, think about ownership at each hop, the contracts between steps, and how to recover when something breaks. A simple diagram often exposes gaps before a single line of code is written.
Worked example. A pipeline that fails silently is worse than one that fails loudly
meter_id to meterId.
The ingestion step still runs. The storage step still runs. Your dashboard still loads. It just starts showing zeros because the join keys no longer match.My opinion: silent failure is the main enemy of data work. It looks like success and it teaches people to distrust the whole system. If you build only one thing into a pipeline, build a check that screams when the shape or meaning changes.
Common mistakes (pipeline edition)
- Building ingestion without a contract (types, required fields, allowed ranges, units).
- Treating “batch vs streaming” as a fashion choice rather than a latency and reliability decision.
- Forgetting recovery. If the pipeline fails at 02:00, what happens at 09:00 when people open the dashboard.
- Not naming an owner for each step, then acting surprised when nothing gets fixed.
Verification. Prove you can reason about it
- Draw a pipeline for one real dataset you know. Identify one failure mode per hop.
- Write one “contract sentence” for ingestion. Example: “Every record must have a timestamp in UTC and a meter identifier treated as a string.”
- Decide whether the dataset is batch or streaming and justify it with a latency requirement.
Simple data flow
Sources to insight with clear boundaries
Sources
Apps, sensors, files.
Ingestion
Collect, queue, validate.
Storage
Databases, lakes, warehouse.
Processing
Clean, join, aggregate.
Consumption
Dashboards, models, APIs.
Quick check. Architectures and pipelines
Why do pipelines exist
Scenario: A dashboard is correct at 09:00 and wrong at 10:00. Name one pipeline failure mode that fits
What is batch movement
What is streaming movement
Scenario: A producer changes a field name and consumers silently break. What boundary was missing
Why start with a diagram
📋Data governance and stewardship
Governance is agreeing how data is handled so people can work quickly without being reckless. Ownership is the person or team that decides purpose and access. Stewardship is the day to day care of definitions, quality, and metadata. Accountability means someone can explain how a change was made and why.
Policies are not paperwork for their own sake. They are guardrails. Who can see a column, how long data is kept, what checks run before sharing. Even a shared spreadsheet is governance in miniature. Who edits, who reviews, what happens when something looks off.
Trust grows when policies are clear, controls are enforced, and feedback loops exist. If a report is wrong and nobody owns it, confidence collapses quickly.
Worked example. The spreadsheet is still governance
If a team shares a spreadsheet called “final_final_v7”, that is governance, just done badly. There is still access control (who has the link), retention (how long it stays in inboxes), and change control (who overwrote which cell). The only difference is that it is informal, invisible, and impossible to audit.
My opinion: governance should feel like good design. It should make the safe thing the easy thing. When governance feels like punishment, teams route around it and create risk you cannot even see.
Common mistakes (governance edition)
- Writing policies that do not match reality, then ignoring the exceptions until they become the norm.
- Treating “owner” as a title rather than a responsibility with time allocated.
- Relying on manual checks for things that should be automated, like schema changes and row count drift.
- Giving everyone access “temporarily” and then never removing it.
Verification. Prove it is not just words
- Choose one dataset and write: purpose, owner, steward, retention period, and who can access it.
- Define one check that would catch a breaking schema change.
- Write one sentence explaining what “accountability” would look like if the report was wrong.
Governance loop
People, policies, data, decisions
People
Owner, steward, users.
Policies
Access, retention, checks.
Data assets
Tables, files, dashboards.
Decision loop
Use, review, improve.
Quick check. Governance and stewardship
Why does governance exist
What does ownership mean
What does stewardship mean
Why are policies useful
Scenario: A team asks for full access “temporarily” to ship a feature. What is a safer governance response
What happens without governance
🔗Interoperability and standards
Interoperability means systems understand each other. It is shared meaning, not just shared pipes. Standards help through common formats, schemas, and naming. A schema is the agreed structure and data types. When systems misalign, data lands in the wrong fields, numbers become strings, or meaning gets lost.
Formats like JSON or CSV carry data, but standards and contracts explain what each field means. An API (application programming interface) without a contract is guesswork. A file without a schema requires detective work.
Small mismatches cause big effects. Dates in different orders, currencies missing, names split differently. Aligning schemas early saves hours of cleanup and prevents silent errors.
Worked example. Join keys are mathematics, not vibes
customer_id as “account holder”. Another uses it as “billing contact”. Both are “customer”, until you try to reconcile charges.My opinion: schema alignment is not a technical chore. It is a meaning negotiation. You need the business definition written down, not just the column name.
Maths ladder. Cardinality and why joins explode
Foundations. One-to-one, one-to-many, many-to-many
Cardinality describes how many records relate to how many records.
- One-to-one: each record matches at most one record on the other side.
- One-to-many: one record matches many on the other side.
- Many-to-many: many records match many, which can multiply rows quickly.
Undergraduate. Why many-to-many joins can blow up row counts
This is why “duplicated keys” are not a small detail. They can change both correctness and performance.
Verification. What you check before you trust a join
- Confirm key definitions match (meaning, not only type).
- Check uniqueness on both sides. If not unique, decide if duplication is expected and safe.
- Compare row counts before and after the join and explain the change.
Interoperability check
Alignment vs mismatch
Aligned
Fields match, types agree, meaning is clear.
Mismatch
Wrong types, missing fields, swapped meanings.
Quick check. Interoperability and standards
What is interoperability
What is a schema
Scenario: Two systems both have a field called `date` but one means local time and one means UTC. What should you do
How do standards help
What happens when schemas mismatch
Why do contracts matter for APIs
📊Data analysis and insight generation
Analysis is asking good questions of data and checking that the answers hold up. Descriptive thinking asks what happened. Diagnostic thinking asks why. Statistics exist to separate signal from noise. Averages summarise, distributions show spread, trends show direction.
Averages hide detail. A long tail or a split between groups can change the story. Trends can be seasonal or random. Always pair a number with context. When it was measured, who is included, what changed.
Insight is not a chart. It is a statement backed by data and understanding. Decisions follow insight, and they should note assumptions so they can be revisited.
Worked example. Correlation is not a permission slip for causation
If two things move together, it might be causation, or it might be a shared driver, or it might be coincidence. In real organisations this becomes painful when a dashboard shows “A rose, then B rose”, and someone writes a strategy based on it.
My opinion: the best analysts are sceptical. They do not say “the chart says”. They say “the chart suggests, under these assumptions”.
Maths ladder. From intuition to inference
Foundations. Mean, median, and why you need both
A level. Correlation (Pearson) and what it measures
- : covariance (how the variables vary together)
- : standard deviations of and
- : correlation coefficient in
Undergraduate. A minimal taste of hypothesis testing
A typical structure:
- : a null hypothesis (for example: no difference between groups)
- : an alternative hypothesis (there is a difference)
- Compute a test statistic from data and derive a p-value under
Common mistakes (analysis edition)
- Treating correlation as causation without a design that supports causal claims.
- Reporting one metric without uncertainty, spread, or context.
- Comparing groups with different baselines and calling it “performance”.
- Forgetting that the definition of the metric can change (new logging, new filtering, new exclusions).
Verification. Prove your insight is defensible
- Write your insight as a sentence and list the assumptions underneath it.
- Show one counterexample or alternative explanation you checked.
- State what data would change your mind.
From data to decision
Question, summarise, decide
Raw data
Events, readings, logs.
Aggregation
Group, filter, summarise.
Insight
Statement linked to evidence.
Decision
Action, experiment, follow up.
Quick check. Analysis and insight
What is descriptive thinking
What is diagnostic thinking
Scenario: The average looks fine but users complain. What should you check next
Why does context matter
What is an insight
🎲Probability and distributions (uncertainty without the panic)
Data work is mostly uncertainty management. Probability is how we stay honest about that. You do not need to love maths to use probability well. You need to be disciplined about what you are claiming.
Worked example. “It usually works” is not a reliability statement
If a pipeline succeeds 99% of the time, it still fails 1 day in 100. Over a year that is multiple failures. The question is not “is 99 good”. The question is “what happens on the failure days, and what does it cost”.
Common mistakes with probability
- Mixing percentage and probability in storage and then comparing wrong numbers.
- Treating rare events as impossible because you have not seen them yet.
- Assuming normal distributions for everything. Many real-world systems have heavy tails.
Verification. A simple sanity check
- If an event happens with probability 1%, how often would you expect it over 10,000 runs.
- If your monitoring only samples 1% of events, what might you miss.
🧪Inference, sampling, and experiments
Inference is the art of learning about a bigger reality from limited observations. This matters because most datasets are not the full world. They are a sample, often a biased one.
Worked example. The “successful customers” dataset that hides the problem
You analyse only customers who completed a journey because that is what is easy to track. Your dashboard shows high satisfaction. The people who dropped off never appear, so the system looks healthier than it is.
Common mistakes in inference
- Confusing “we observed it” with “it is true for everyone”.
- Reporting point estimates without uncertainty or sample size.
- Treating A/B tests as truth machines without checking bias and instrumentation.
Verification. Ask the sampling questions
- Who is included. Who is missing.
- What would cause a person or event to drop out of the dataset.
- If the measurement process changes, how would you detect it.
🤖Modelling basics (regression, classification, and evaluation)
Modelling is not magic. It is choosing inputs, choosing an objective, and checking failure modes. The purpose of modelling is not to impress people. It is to make a useful prediction with known limitations.
Worked example. 99% accuracy that is still useless
If only 1% of cases are fraud, a model that always predicts “not fraud” gets 99% accuracy. That is why evaluation needs multiple metrics and a clear cost model for errors.
Common mistakes in modelling
- Leakage: the model sees a proxy for the answer.
- Optimising one metric and ignoring harm elsewhere (false positives, workload, trust).
- Choosing a threshold once and never revisiting when behaviour changes.
Verification. A minimal model review
- What is the label and who decides it.
- What are the top 3 features, and what proxies might they represent.
- What is the cost of false positives and false negatives.
- What does “human in the loop” mean here, in practice.
📦Data as a product (making datasets usable, not just available)
A mature organisation treats important datasets like products. They have owners, documentation, quality expectations, and support. This is how you reduce “shadow spreadsheets” and make reuse normal.
Worked example. The “can you send me the extract” culture
If every request becomes a one-off extract, you are not serving data. You are doing bespoke reporting at scale. A data product replaces that with a stable interface, clear meaning, and quality guarantees.
Verification. Write a data product page in five lines
- Name: what it is and what it is not.
- Owner: who is accountable.
- Refresh: how often it updates.
- Quality: what checks run.
- Access: who can use it and why.
⚖️Risk, ethics and strategic value
Data risk is broader than security. Misuse, misinterpretation, and neglect can harm people and decisions. Ethics asks whether we should use data in a certain way, not just whether we can. Strategic value comes from using data to improve services, not just to collect more of it.
Risk points appear along the lifecycle. Collection without consent, processing without checks, sharing beyond purpose, keeping data forever. Controls and culture reduce these risks. A small habit, like logging changes or reviewing outliers, prevents large mistakes.
Treat data as a long term asset. Good stewardship, clear value cases, and honest communication build trust that lasts beyond a single project.
Verification and reflection. Show professional judgement
- Pick one risky scenario from the tool below and write the “least bad” option, with a justification.
- Name the stakeholder most likely to be harmed if you get this wrong, and what “harm” looks like in practice.
- Describe one control you would add that is realistic for a small team, not only for large enterprises.
Risk and ethics view
Lifecycle with careful choices
Collect
Consent and purpose confirmed.
Process
Checks for bias and errors.
Share
Controlled access and logs.
Archive or delete
Retention reviewed, data minimised.
Quick check. Risk, ethics, and strategic value
What is data risk beyond security
Scenario: A vendor offers a discount if you share richer customer data. What question should you ask first
Why do ethics matter
Where do risks appear
What builds long term value
Why log changes and outliers
📊Shared dashboards you already know
Intermediate will reuse the data quality sandbox, the data flow visualiser, and the data interpretation exercises introduced in Foundations. They will gain more options, but the familiar interfaces stay so you can focus on thinking, not navigation.
Visualisation and communication (the part that decides whether anyone listens)
A dashboard is a user interface for decision-making. If the UX is confusing, you do not get slightly worse decisions. You get busy people making fast guesses.
Worked example. The misleading chart that got a team funded
A chart uses a truncated y-axis so a small change looks dramatic. The story is compelling, funding arrives, and later the organisation realises the effect size was tiny. This is not only a chart problem. It is a governance problem, because the decision process did not demand context and uncertainty.
Common mistakes in data communication
- Truncated axes and missing baselines.
- No time window clarity, so people compare unlike periods.
- Mixing counts, rates, and percentages on one chart without explanation.
- Colour choices that fail for colour-blind users or overload attention.
Verification. A chart quality checklist
- Can a reader tell what the unit is without guessing.
- Can a reader tell what time window and population are included.
- Is the scale honest and consistent.
- Is there at least one sentence that states the insight and the caveat.
🧾CPD evidence (keep it honest and specific)
This level is CPD-friendly because it builds professional judgement. The best evidence is not “I read a page”. It is “I designed a control, checked a failure mode, and changed how I will work”.
- What I studied: pipelines, governance, interoperability, analysis, and risk.
- What I applied: one pipeline diagram for a real dataset and one check that would catch silent failure.
- What I learned: one insight about meaning contracts (schema, keys, or definitions) that I will now insist on.
- Evidence artefact: the diagram plus a short note on assumptions and checks.
