CPD timing for this level
Advanced time breakdown
This is the first pass of a defensible timing model for this level, based on what is actually on the page: reading, labs, checkpoints, and reflection.
What changes at this level
Level expectations
I want each level to feel independent, but also clearly deeper than the last. This panel makes the jump explicit so the value is obvious.
Architecture, governance, and trade-offs at scale.
Not endorsed by a certification body. This is my marking standard for consistency and CPD evidence.
- A short 'numbers I do not trust' list for a dashboard and the checks that would earn trust.
- A distribution and bias review: what could distort conclusions and how you would communicate uncertainty.
- An architecture trade-off note: warehouse vs lakehouse vs hybrid, including cost, latency, governance, and ownership.
Data Advanced
CPD tracking
Fixed hours for this level: 12. Timed assessment time is included once on pass.
View in My CPDCPD and certification alignment (guidance, not endorsed)
Advanced is about scale, uncertainty, and strategy. This is where you learn to be precise when stakes are high. It maps well to:
- DAMA DMBOK and CDMP mindset (governance and stewardship at scale)
- Cloud data architecture tracks (warehouse, lakehouse, streaming, reliability, cost)
- Statistics foundations used across respected professional curricula (sampling, bias, uncertainty, distribution thinking)
This level assumes you are comfortable with Foundations and Intermediate. The focus now is scale, complexity, mathematical reasoning, and strategic impact. The notes are written as my perspective on how senior data professionals and architects think when systems get serious.
🔢Mathematical foundations of data systems
Maths in data systems describes patterns, uncertainty, and change. Abstraction turns messy reality into numbers we can reason about. At small scale numbers feel friendly. At scale, tiny errors compound and variation matters more than a single “best guess”.
My opinion: people do not fear maths because it is hard. They fear it because it is often introduced without kindness. If I introduce a symbol, I will tell you what it means, and I will show you a concrete example before I move on.
Mean (average)
Definitions:
- : the -th value in your dataset
- : number of values
- : mean of the values
Variance (spread)
Definitions:
- : variance
- : deviation from the mean
Probability distributions describe how likely values are. A fair coin has a distribution of 0.5 heads and 0.5 tails. Real data distributions are rarely symmetrical. Knowing the shape stops us from trusting averages blindly.
Maths ladder (optional). From intuition to rigour, properly explained
Foundations. Ratios, units, and “does this number even make sense”
Expert data work starts with units. A number without a unit is a rumour. If one system logs energy in kWh and another in MWh, the data can be perfectly stored and perfectly wrong. Simple check: write the unit beside the value and ask if the magnitude is plausible.
Next step. Dot product (a simple way to combine measurements)
Intuition: it combines two lists of measurements into one number. In modelling, dot products appear everywhere (for example, linear models).
Deeper. Bayes’ theorem (updating beliefs with evidence)
Bayes’ theorem connects conditional probabilities:
- : a hypothesis (for example, “the alert is a real incident”)
- : observed evidence (for example, “we saw a suspicious login pattern”)
- : prior belief about before seeing
- : likelihood of seeing if is true
- : overall probability of seeing
Why it matters: many “data decisions” are really belief updates under uncertainty. The maths keeps you honest about what evidence can and cannot justify.
Rigour direction (taste). Error propagation and compounding uncertainty
In real pipelines, uncertainty compounds. A value is measured, transformed, aggregated, and modelled. If each step adds error, you can end up with a final metric that looks precise but is not.
The serious lesson: do not only track “numbers”. Track how the numbers were produced and what could distort them. This is why lineage and verification exist.
From observations to vectors
Raw values into structure
Raw observations
Temperatures: 18, 20, 22.
Vector
[18, 20, 22]
Aggregation
Mean 20, variance measures spread.
Quick check. Mathematical foundations
Scenario: The average call wait time improved, but complaints increased. Give a data reason that could explain both being true
What does variance capture
What is standard deviation
Scenario: You compare two conversion rates from small samples. What should you be careful about before declaring a winner
Why can averages lie
🗄️Data models and abstraction at scale
Models are simplified representations of reality. They exist so teams can agree on how data fits together. Abstraction hides detail to make systems manageable. The risk is that hidden detail was needed for a decision you care about.
Entity relationships show how things connect. Customers place orders, orders contain items. Dimensional models separate facts (events) from dimensions (who, what, when). Simpler models are easy to query but may miss nuance. Richer models can be harder to govern.
Design trade offs are unavoidable. A lean model may skip location because it is not needed today. Later, when someone asks about regional patterns, the model cannot answer. Bias also hides in models: if a field is dropped, whole groups can disappear from analysis.
Worked example. The field you delete is the question you cannot answer later
A team drops location data “because it is messy”. Six months later, an incident requires regional analysis. The team scrambles for ad-hoc extracts and guesses, because the model made the question impossible.
My opinion: data models are long-term commitments. When you drop a field, you are not only simplifying. You are deciding which questions future you is not allowed to ask.
Verification. Check your model before you build on it
- Name three questions your model must answer today.
- Name one question it should answer in six months, even if you do not need it yet.
- Identify one field that is high risk (personal data, sensitive attributes) and state how it will be protected or removed.
Abstraction and loss
What the model keeps vs removes
Raw world
All details captured.
Simplified model
Key entities and fields only.
Lost detail
Questions that are now impossible.
Quick check. Models and abstraction
What is abstraction
How can models create bias
Why do dimensional models separate facts and dimensions
Scenario: A team deletes location because it is messy. Six months later you need regional analysis. What happened
What is a design trade off
📈Advanced analytics and inference
Inference is about drawing conclusions while admitting uncertainty. Correlation means two things move together. Causation means one affects the other. Mistaking correlation for causation leads to confident but wrong decisions.
Sampling takes a subset of the population. If the sample is biased or too small, the answer will drift from reality. Confidence is how sure we are that the sample reflects the population. Errors creep in when data is noisy, samples are skewed, or models are overconfident.
Statistics is humility with numbers. Every estimate should come with a range and a note on what could be wrong.
Common mistakes (the expensive ones)
- Treating “statistically significant” as “important”, without checking effect size and practical impact.
- Running many comparisons until something looks exciting, then stopping.
- Treating a model score as truth rather than a measurement with uncertainty and failure modes.
- Reporting a single number without showing distribution or tail behaviour.
Sampling and bias
Population vs sample
Population
All data points.
Sample
Subset that may miss parts of the population.
Quick check. Analytics and inference
What is correlation
What is causation
Scenario: Your dataset only includes customers who completed a journey. What bias risk does that introduce
Why is sampling risky
Why include confidence
☁️Data platforms and distributed systems
Data systems distribute to handle scale and resilience. Latency is the time it takes to respond. Consistency is how aligned copies of data are. Failures are normal in distributed systems, so we plan for them instead of hoping they do not happen.
A simple intuition: you can respond fast by reading nearby copies, but those copies might be slightly out of date. Or you can wait for all copies to agree and respond slower. Users and use cases decide which trade off is acceptable.
Worked example. The dashboard says “yesterday”, but the decision is “now”
Eventual consistency can be perfectly acceptable for a monthly report. It can be unacceptable for fraud detection, outage response, or operational dispatch.
My opinion: the right question is not “is it consistent”. The right question is “consistent enough for which decision, at which time”.
Distributed trade offs
Nodes, replication, and failure paths
Nodes
Store and serve data.
Replication
Copies for resilience.
Failure paths
What happens when a node is slow or down.
Quick check. Platforms and distributed systems
Why do systems distribute
What is latency
What is consistency
Scenario: A fraud system must be correct now. Is eventual consistency a good fit
Why is perfection impossible
⚖️Governance, regulation and accountability
Regulation exists to protect people and markets. Accountability means someone can explain what data is used, why, and with what safeguards. Auditability means we can trace who did what and when. These are not just legal boxes. They build trust with users and stakeholders.
Ethics and trust sit beside regulation. If a decision harms people, compliance alone is not enough. Long term consequences include fines, loss of reputation, and slower delivery because teams stop trusting data.
Governance at scale. A practical view of DAMA style coverage
Many organisations use a DAMA DMBOK style lens to describe data management capabilities. I treat it as an orientation map, not scripture. The useful part is that it forces you to look at the whole system, not only the warehouse.
Data management capability map (plain English)
What you must be able to do, not what you must call it
Governance and ownership
Decision rights, policy, accountability, stewardship.
Architecture and modelling
Canonical models, contracts, schemas, and context boundaries.
Quality, metadata, lineage
Definitions, freshness, profiling, and traceability.
Security and privacy
Access, retention, audit, encryption, monitoring.
Platforms and operations
Storage choices, reliability, cost, and incident response.
Delivery and consumption
Dashboards, APIs, data products, and user experience.
Common mistakes (enterprise governance edition)
- Calling something “governed” because there is a document, not because controls are enforced.
- Creating committees with no decision rights, then wondering why teams route around them.
- Treating metadata as optional. When incidents happen, metadata becomes the evidence trail.
Verification. A defensible explanation a regulator would accept
- Write one paragraph explaining what the dataset is for, who can access it, and why.
- State what would trigger an investigation (unexpected access, unusual exports, anomalous changes).
- Describe one control that reduces harm, not only one that reduces paperwork.
Oversight across the lifecycle
Checks at each stage
Collect
Consent and lawful basis.
Store
Access controls and retention.
Use
Purpose limits and monitoring.
Share
Contracts, masking, logging.
Quick check. Governance, regulation, and accountability
Why does regulation exist
What is accountability
Why is auditability useful
Scenario: A dataset is compliant to share, but it will predict something sensitive people did not expect. What should you do
Why is ethics more than compliance
💼Data as a strategic and economic asset
Data creates value when it improves decisions, products, and relationships. Network effects appear when sharing makes each participant better off. Competitive advantage comes from combining quality data with disciplined execution, not from hoarding alone.
Monetisation can be direct (selling insights) or indirect (better products). Lock in can help or hurt: it keeps customers, but it can also trap you with legacy systems. Long term risk comes from overcollecting, underprotecting, or failing to renew data pipelines.
Data as a product (the difference between reuse and “please send me the extract”)
If every request becomes a one-off extract, you are not running a data capability. You are running a bespoke reporting service. A data product is a dataset with an interface, documentation, quality guarantees, and an owner. It is designed for consumers.
Data mesh, used properly (not as a slogan)
Data mesh is a response to a real organisational problem: central teams become bottlenecks because domains do not own their data. The useful idea is domain ownership plus platform support plus federated governance. The dangerous version is “every team does whatever they want”.
Worked example. A data mesh that failed because the platform was missing
Leaders announce “data mesh”. Domains are told to publish data products. There is no shared platform, no templates, and no quality tooling. Domains publish inconsistent datasets and consumers lose trust.
My opinion: you cannot decentralise responsibility without centralising enablement. If you want domain ownership, you must provide a self-serve platform and a small set of enforced standards.
Verification. Strategy that is not just motivational posters
- Name one value outcome (time saved, risk reduced, revenue protected) and how you would measure it.
- Name one dependency (people, platform, governance) that could block the value.
- Write one uncomfortable trade-off you will accept, and why.
Value over time
Invest, build trust, realise outcomes
Invest
Quality, sharing, analysis.
Trust
Users rely on the data.
Value
Better decisions, new revenue, reduced waste.
Quick check. Data as a strategic asset
What creates data value
What is a network effect
Scenario: A data mesh programme fails quickly. Name one missing ingredient that often explains it
How can lock in hurt
Why think long term
🧾CPD evidence (senior level, still honest)
At this level, your evidence should show judgement under uncertainty. You are no longer proving you can repeat definitions. You are proving you can make trade-offs, explain them, and defend them.
- What I studied: advanced maths foundations, inference, distributed trade-offs, governance, and strategy.
- What I applied: one concrete decision. Example: “I chose eventual consistency for reporting but required stronger guarantees for operational alerts.”
- What could go wrong: one failure mode and the control that would detect it early.
- Evidence artefact: a short decision record (ADR-style) with assumptions, trade-offs, and verification steps.
