CPD timing for this level

Foundations time breakdown

This is the first pass of a defensible timing model for this level, based on what is actually on the page: reading, labs, checkpoints, and reflection.

Reading
38m
5,741 words · base 29m × 1.3
Labs
180m
12 activities × 15m
Checkpoints
30m
6 blocks × 5m
Reflection
48m
6 modules × 8m
Estimated guided time
5h 56m
Based on page content and disclosed assumptions.
Claimed level hours
8h
Claim includes reattempts, deeper practice, and capstone work.
The claimed hours are higher than the current on-page estimate by about 3h. That gap is where I will add more guided practice and assessment-grade work so the hours are earned, not declared.

What changes at this level

Level expectations

I want each level to feel independent, but also clearly deeper than the last. This panel makes the jump explicit so the value is obvious.

Anchor standards (course wide)
DAMA-DMBOK (data management framework)UK GDPR and ICO guidance (where privacy matters)
Assessment intent
Foundations

Vocabulary, formats, and basic quality reasoning.

Assessment style
Format: mixed
Pass standard
Coming next

Not endorsed by a certification body. This is my marking standard for consistency and CPD evidence.

Evidence you can save (CPD friendly)
  • A one page dataset definition: what it means, unit, owner, update frequency, and the decision it supports.
  • A simple data quality checklist you ran on a real dataset (missingness, duplicates, invalid values) plus what you changed.
  • A lifecycle map: where the data comes from, where it is stored, who uses it, and what could make it wrong.

Data Foundations

Level progress0%

CPD tracking

Fixed hours for this level: 8. Timed assessment time is included once on pass.

View in My CPD
Progress minutes
0.0 hours

CPD and certification alignment (guidance, not endorsed)

This course is written to support defensible CPD evidence and practical competence. It also covers skills that map well to respected pathways, without claiming endorsement:

  • DAMA DMBOK and the CDMP mindset (governance, stewardship, meaning and quality)
  • CompTIA Data+ as a baseline for applied data competence
  • Vendor data engineering tracks (AWS, Azure, Google Cloud) for architecture and delivery patterns
How to use Data Foundations
If you are new, I will keep this simple without lying. If you are experienced, I will keep it rigorous without showing off.
Good practice
Pick one dataset you know and apply each concept to it. Meaning, units, missingness, ownership, and what could make it wrong.
Bad practice
Best practice

This level sets out how data exists, moves, and creates value before any heavy analysis or security work. It keeps the language simple, introduces light maths, and shows how real systems depend on data discipline.


📊

What data is and why it matters

Concept block
Event to decision
Data becomes useful when it keeps meaning from capture to decision.
Data becomes useful when it keeps meaning from capture to decision.
Assumptions
Definitions are shared
Units are explicit
Failure modes
Numbers without meaning
Decision drift

Data starts as recorded observations: numbers on a meter, text in a form, or pixels in a photo. When we add structure it becomes information that people can read. When we apply it to decisions it becomes knowledge. Data existed long before computers: bank ledgers, census books, medical charts. Modern systems are data driven because every click, sensor, and transaction can be captured and turned into feedback.

Banking relies on clean transaction data to spot fraud. Energy grids depend on meter readings to balance supply and demand. Healthcare teams use lab results and symptoms to guide care. AI systems learn from past data to make predictions, which means they also inherit any gaps or mistakes. Keeping the difference between raw data, information, and knowledge clear helps us avoid mixing facts with opinions.

From event to decision

Real world to recorded data to informed action

Real world event

A card payment or a patient temperature.

Recorded data

Numbers, text, or images captured at the time.

Information

Structured so people and systems can read it.

Decision

Approve a payment, adjust a turbine, triage a patient.

Quick check. What data is and why it matters

What is data

Scenario: A spreadsheet says '12'. What extra information turns that into something usable

How does data become information

Scenario: Two teams report different revenue numbers for the same month. Name two likely data reasons before you blame the people

How does information become knowledge

Why do AI models inherit data issues

🧠

Data, information, knowledge, judgement

Concept block
DIKW as a map
DIKW is useful when it keeps facts separate from interpretation and decision.
DIKW is useful when it keeps facts separate from interpretation and decision.
Assumptions
Interpretation is stated
Judgement is owned
Failure modes
Facts and opinions mixed
Automation without judgement

I want a simple model in your head that stays useful even when the tools change. A common one is DIKW. The point is not the pyramid. The point is the distinction.

DIKW (useful version)

From recorded observations to decisions you can defend

Data

Recorded observations.

Example: readings, clicks, timestamps.

Information

Data with context and meaning.

Example: units, location, who collected it, what it represents.

Knowledge

Patterns you can explain.

Example: demand rises at 18:00, outages cluster after storms.

Judgement

Action under uncertainty.

Example: intervene, hold, investigate, or automate with guardrails.

Worked example. A number without context is a rumour

Suppose a dashboard shows “12.4”. Is that 12.4 kWh. 12.4 MWh. 12.4 percent. 12.4 incidents. 12.4 minutes. The number is not the problem. The missing context is the problem.

My opinion: if you cannot answer “what does this represent” and “what would make it wrong”, you do not have information yet. You have vibes with a font size.

Common mistakes (DIKW edition)

  • Treating charts as truth without asking how the data was produced.
  • Mixing “measurement” with “meaning”. The sensor measured something. You are interpreting it.
  • Skipping uncertainty. Many datasets are estimates, not direct observations.

Verification. Prove you can separate meaning from numbers

  • Pick one metric you care about. Write the unit, the definition, and the decision it supports.
  • Write one way it could be wrong (missing data, unit mismatch, selection bias, duplication).
  • Write one check that would detect that failure mode.

📏

Units, notation, and the difference between percent and probability

Concept block
Units protect meaning
Units and notation are how you stop data from lying through ambiguity.
Units and notation are how you stop data from lying through ambiguity.
Assumptions
Units are written where used
Conversions are controlled
Failure modes
Unit mismatch
Ambiguous notation

Data work goes wrong when people are casual about units. Units are not decoration. Units are the meaning. This is why I teach it early and I teach it bluntly.

Worked example. kWh and MWh are both “energy” and still not the same number

If one dataset records energy in kWh and another records energy in MWh, then the same physical quantity will appear with numbers that differ by a factor of 1000. A join can be perfectly correct and the final answer can be perfectly wrong.

A small cheat sheet you can reuse

  • Percent: out of 100. Example: 12% means 12 out of 100.
  • Probability: out of 1. Example: 0.12 means 12 out of 100.
  • Rate: per unit time. Example: 3 requests per second.
  • Count: how many. Example: 3 outages.
  • Amount: quantity with unit. Example: 3 kWh.

Verification. Spot the three most common confusion traps

  • If a field is a percentage, is it stored as 12 or 0.12. Write it down.
  • If a timestamp exists, is it UTC. If not, what is it.
  • If a value looks “too big” or “too small”, check the unit before you argue about the trend.

📝

Data representation and formats

Concept block
Representation layers
Representation choices affect what can be stored, exchanged, and trusted.
Representation choices affect what can be stored, exchanged, and trusted.
Assumptions
Encoding is consistent
Schema reflects reality
Failure modes
Silent truncation
Incompatible formats

Computers store everything using bits (binary digits) because hardware can reliably tell two states apart. A byte is eight bits, which can represent 256 distinct values. Encoding maps symbols to numbers, while a file format adds structure on top. CSV is plain text with commas, JSON wraps name value pairs, XML uses nested tags, images store grids of pixels, and audio stores wave samples. The wrong format or encoding breaks systems because the receiver cannot parse what was intended.

A byte can represent 0 to 255. Powers of two help size things: 23=82^3 = 8 means three binary places can represent eight values. Plain English: two multiplied by itself three times equals eight. Binary choices stack quickly.
ItemMeaning
BitSmallest unit, either 0 or 1
Byte8 bits, often one character in simple encodings
2n2^nNumber of combinations with n bits

Characters to numbers to bits

Encoding then binary storage

Characters

"A", "7", "e"

Numbers via encoding

"A" -> 65, "7" -> 55, "e" -> 101

Bits in memory

65 -> 01000001

Worked example. The same text, different bytes

Here is a simple truth that causes surprising damage in real systems: the same characters can be stored as different bytes depending on encoding. If one system writes text as UTF-8 and another reads it as something else, the data is not “slightly wrong”. It is wrong.

My opinion: if your system depends on humans “remembering” encodings, it is already broken. It should be explicit in the interface contract and tested like any other behaviour.

Common mistakes (and how to avoid them)

  • Assuming “text is text” and skipping the encoding field in a file export.
  • Mixing CSV with commas inside fields but not using proper quoting.
  • Treating JSON as “order matters”, then writing brittle parsers that depend on property order.
  • Losing leading zeros by storing identifiers as numbers (postcode, meter IDs, account IDs). If it is an identifier, it is usually a string.

Tiny habit, huge payoff

If a value is “a label”, store it as a string. If it is “a quantity you do maths on”, store it as a number. This simple rule prevents a lot of silent data damage.

Verification. How you know you understood it

  • Use the tool above and confirm you can explain the difference between the character "A", the number 65, and the bits 01000001.
  • Take one file format you use at work and write down what makes it structured. For example: delimiters, quoting rules, schema expectations, or metadata.
  • Explain, in one paragraph, why “binary” is about representation and not about meaning.

Maths ladder (optional). From intuition to rigour

You can learn data without advanced maths, but you cannot become an expert without eventually becoming comfortable with symbols. The goal here is not to show off. It is to make the symbols friendly and precise.

Foundations. Powers of two and counting possibilities

If a system has n bits, each bit has two possible states (0 or 1). The total number of possible bit patterns is:

2n2^n
  • nn: number of bits (an integer)
  • 2n2^n: number of distinct patterns (how many different values you can represent)

Example: n=8n = 8 (one byte). Then 28=2562^8 = 256. So a byte can represent 256 distinct values, typically 0 to 255.

Next step. Base conversion and why it matters for data

A binary number is a sum of powers of two. If you see 0100000101000001, the 1s mark which powers are included:

010000012=027+126+025+024+023+022+021+12001000001_2 = 0\cdot2^7 + 1\cdot2^6 + 0\cdot2^5 + 0\cdot2^4 + 0\cdot2^3 + 0\cdot2^2 + 0\cdot2^1 + 1\cdot2^0

That equals 64+1=6564 + 1 = 65. Why it matters: when data gets corrupted at the byte level (bad encoding, wrong parsing, truncation), the meaning upstream is gone. You cannot “fix it later” reliably because you do not know what the original bits were meant to represent.

Deeper. Information content (intuition)

The less predictable something is, the more information it carries. If a value is always the same, it carries no surprise. A common formal measure is entropy. In the simplest discrete case:

H(X)=xp(x)log2p(x)H(X) = -\sum_x p(x)\log_2 p(x)
  • XX: a random variable (the thing that can take different values)
  • xx: a particular value of XX
  • p(x)p(x): probability that X=xX = x
  • H(X)H(X): entropy in bits

Example: a fair coin has p(heads)=0.5p(\text{heads}) = 0.5, p(tails)=0.5p(\text{tails}) = 0.5. Then H(X)=1H(X) = 1 bit. A biased coin has less. Why it matters in data: highly predictable fields can still be important (for joining and identifiers), but they often carry little information for modelling. This is one reason “more columns” is not the same as “more value”.

Quick check. Representation and formats

What is a bit

What does encoding do

Scenario: A colleague opens a CSV and names look corrupted (odd symbols). What is the likely cause

What is CSV

Scenario: When would you pick JSON over CSV

Scenario: A dataset has leading zeros in IDs but Excel keeps removing them. What should you do

Why does binary suit computers

🔌

Standards, schemas, and interoperability

Concept block
Standards reduce translation
Standards reduce repeated translation work by creating shared meaning and stable interfaces.
Standards reduce repeated translation work by creating shared meaning and stable interfaces.
Assumptions
Standards are adopted
Mappings are maintained
Failure modes
Paper standards
Semantic drift

Interoperability is a boring word for a very expensive problem: two systems can both be “correct” and still disagree because they mean different things. Standards are the shared rules that reduce translation work. Not because standards are morally pure, but because without shared meaning you spend your life reconciling spreadsheets and arguing in meetings.

What a standard really is

A standard can be: a file format (CSV, JSON), a schema (field definitions), a data model (how entities relate), or a message contract (API request and response). Good standards do two jobs: they make systems compatible, and they make errors visible earlier.

Worked example. “Customer” broke your dashboard, not your code

System A records “customer” as the bill payer. System B records “customer” as the person who contacted support. A dashboard joins them and reports “customers contacted”. Leadership changes policy based on that number. Nobody wrote a bug. The definition was the bug.

Common mistakes in standards work

  • Choosing field names first and definitions later.
  • Treating schemas as documentation only, not as something you validate in pipelines.
  • Changing a contract without versioning, then blaming “downstream” when things break.

Verification. A small contract you can write today

  • Pick one dataset. Write 5 fields with units, allowed ranges, and what “missing” means.
  • Write one identifier field and state whether it is a string or number, and why.
  • Write what changes are breaking, and how you would version them.

🌍

Open data, data sharing, and FAIR thinking

Concept block
Open and FAIR as choices
Open data and FAIR principles help when you are clear about audience, risk, and governance.
Open data and FAIR principles help when you are clear about audience, risk, and governance.
Assumptions
Audience is known
Risk is assessed
Failure modes
Accidental disclosure
Unusable open data

Open data is not “everything on the internet”. It is a choice about access and reuse. Some data should be open because it improves transparency and innovation. Some data must stay restricted because it contains personal or security-sensitive information. A mature organisation can explain the difference without hand-waving.

The data spectrum (closed, shared, open)

Most real-world data lives in the middle: shared with specific parties under agreements. The useful question is not “open or closed”. It is “who can access, for what purpose, with what safeguards, and for how long”.

FAIR as a quality lens

FAIR means findable, accessible, interoperable, reusable. It does not automatically mean public. It is a lens you can use to judge whether a dataset is actually usable by someone who is not already in your team.

Common mistakes in data sharing

  • Publishing data with no metadata, then acting surprised when people misuse it.
  • Sharing data without a licence or usage rules, then arguing about “ownership” later.
  • Removing identifiers and assuming it is anonymous. Many datasets can be re-identified via linkage.

Verification. Make a dataset shareable in a way you would trust

  • Write a title, description, update frequency, and contact owner.
  • List the units and definitions for key fields.
  • State what the dataset can and cannot be used for.
  • State whether it is open, shared, or restricted, and why.

📈

Visualisation basics (so charts do not lie to you)

Concept block
Charts are arguments
Visualisation is how you argue with evidence. Poor charts mislead quietly.
Visualisation is how you argue with evidence. Poor charts mislead quietly.
Assumptions
Axes are honest
Context is included
Failure modes
Misleading scale
Chart without question

Visualisation is part of data literacy. A chart is an argument. It can be honest or misleading. The goal in Foundations is not to become a designer. The goal is to stop being fooled by bad charts, including your own.

Worked example. Same data, different story

Two charts show the same numbers. One uses a consistent scale. The other uses a cropped axis so small changes look huge. If you react emotionally to the second chart, that is not a personal flaw. That is a design choice manipulating attention.

Verification. Four questions before you trust a chart

  • What is the unit.
  • What is the time window.
  • What is included and excluded.
  • Would a different scale change the feeling of the story.

Data quality and meaning

Concept block
Quality is a loop
Quality is not a one-time check. It is detection and repair over time.
Quality is not a one-time check. It is detection and repair over time.
Assumptions
Quality has owners
Expectations are written
Failure modes
Quality checked once
Fixing symptoms

Quality means data is accurate (close to the truth), complete (not missing key pieces), and timely (fresh enough to be useful). A sensor reading of 21°C is useless if the timestamp is missing. Noise is random variation that hides patterns, while signal is the meaningful part. Bias creeps in when some groups are missing or when measurements are skewed. Models and dashboards inherit these flaws because they cannot tell if the input is wrong.

Context and metadata preserve meaning: units, collection methods, and who collected the data. If a temperature has no unit, is it Celsius or Fahrenheit? Data without context invites bad decisions.

Worked example. One outlier can rewrite your story

Suppose we record response times for a service (in milliseconds): 110, 120, 115, 118, 5000. The first four values look like a normal service. The last value could be a real outage or a measurement problem. If you only report the average, you might accidentally tell everyone the service is slow when it is usually fine, or tell everyone it is fine when the tail behaviour is hurting real users.

My opinion: whenever someone shows me a single average, I immediately ask “what is the spread?” and “what does bad look like?”. That one habit saves weeks of nonsense.

Maths ladder. Signal, noise, and how to measure “typical”

Foundations. Mean (average) and why it can mislead

The mean of values x1,x2,,xnx_1, x_2, \dots, x_n is:

xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i
  • nn: number of values
  • xix_i: the ii-th value
  • xˉ\bar{x}: the mean

In the example 110, 120, 115, 118, 5000, the mean is pulled up sharply by 5000.

A level. Variance and standard deviation (spread)

A simple measure of spread is the (population) variance:

σ2=1ni=1n(xixˉ)2\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2
  • σ2\sigma^2: variance
  • σ\sigma: standard deviation, where σ=σ2\sigma = \sqrt{\sigma^2}

Intuition: variance is “average squared distance from the mean”. Large variance means values are spread out.

Undergraduate. Sampling bias and missingness (why data can be wrong even when numbers are correct)

Data can be numerically correct and still misleading if the sample does not represent the population you care about. This is sampling bias. Missingness matters too. Missing completely at random is rare in real systems. Often values are missing because of a reason (sensor downtime, people not completing forms, systems timing out). When missingness has structure, it can distort analysis and models.

Rigour direction (taste, not a full course). Robust statistics

Real systems often produce heavy tails and outliers. Robust methods reduce sensitivity to extremes. Two examples you will meet in serious work:

  • Medians and quantiles: focus on typical behaviour and tail risk.
  • M-estimators: replace squared error with loss functions that punish outliers less aggressively than (xμ)2(x-\mu)^2.

You do not need to memorise these now. The point is to build the instinct: choose summaries that match the decision you are making.

Common mistakes (the ones I keep seeing)

  • Treating missing values as “just blanks” without asking why they are missing.
  • Removing outliers automatically without checking whether they represent real failures.
  • Mixing units (kW vs MW, seconds vs milliseconds) and then trusting the graph.
  • Using accuracy language loosely. “Accurate” is about closeness to truth, not “looks plausible on a dashboard”.
  • Treating an ID as a number, then losing leading zeros and breaking joins later.

Verification. Prove your understanding

  • Take a tiny dataset you control (even five rows) and write down: units, timestamp meaning, and acceptable range.
  • Identify one field that is a quantity and one that is an identifier, and justify the data type choice.
  • Explain the difference between noise and bias with one concrete example of each.

Clean vs noisy data

Quality affects every step

Clean data

Complete timestamps, sensible ranges, clear units.

Noisy or biased data

Missing fields, extreme outliers, underrepresented groups.

Quick check. Data quality and meaning

What is accuracy

What is completeness

What is timeliness

Scenario: A key field is 30 percent missing for one region. What should you do before building a model or a dashboard

How does bias enter data

Why do models inherit data problems

Scenario: A number looks correct but decisions based on it are wrong. What is a common data reason

Why is metadata important

🔄

Data lifecycle and flow

Concept block
Lifecycle and flow
Data moves through stages. Each stage can add value or add risk.
Data moves through stages. Each stage can add value or add risk.
Assumptions
Lineage exists
Retirement is planned
Failure modes
Unknown provenance
Shadow copies

Data starts at collection, gets stored, processed, shared, and eventually archived or deleted. Each step has design choices: where to store, how to process, how to secure, and when to retire. Software architecture cares about where components sit. Cybersecurity cares about protection at each hop. AI pipelines care about how raw data becomes features.

Deletion matters because stale data can mislead, cost money, or breach privacy. A clear lifecycle stops random copies and reduces attack surface.

End to end data lifecycle

Loop with ownership at every step

Collect

Forms, sensors, logs.

Store

Databases, lakes, queues.

Process

Cleaning, joins, enrichment.

Share

APIs, files, dashboards.

Archive or delete

Retention, compliance, cost control.

Quick check. Lifecycle and flow

Name the first lifecycle step

Why is processing needed

Why is sharing controlled

Why does deletion matter

Scenario: A team copies customer data into a personal folder to 'work faster'. Which lifecycle step did they bypass

How does architecture connect

How does cybersecurity connect

How do AI pipelines fit

👥

Data roles and responsibilities

Concept block
Ownership prevents chaos
Data work improves when decision rights and responsibilities are explicit.
Data work improves when decision rights and responsibilities are explicit.
Assumptions
Decision rights are explicit
Consumers can report issues
Failure modes
Everyone responsible means no one responsible
Unowned breaking changes

Roles exist so someone is accountable for quality, access, and change. Data owners make decisions about purpose and access. Data stewards guard definitions, metadata, and policy. Data engineers build and maintain pipelines. Data analysts turn data into insights. Data consumers use the outputs responsibly. When roles blur, pipelines stall, privacy is ignored, or dashboards contradict each other.

Role responsibility map

Who does what and why it matters

Owner

Sets purpose, approves access.

Steward

Keeps definitions and metadata clean.

Engineer

Builds and operates pipelines safely.

Analyst and consumer

Uses outputs, shares insights, flags gaps.

Quick check. Roles and responsibilities

What does a data owner decide

What does a data steward maintain

What does a data engineer build

What does a data analyst do

Who is a data consumer

Scenario: A dashboard number looks wrong. What is a sensible first move before arguing

What happens when roles blur

⚖️

Foundations of data ethics and trust

Concept block
Should we use this data
Ethics and trust are decision logic, not vibes. Use a clear decision path.
Ethics and trust are decision logic, not vibes. Use a clear decision path.
Assumptions
Purpose is stated
Safeguards exist
Failure modes
Consent theatre
Secondary use creep

Ethics matters from the first data point. Consent means people know and agree to how their data is used. Privacy keeps personal details safe. Transparency builds trust because people can see what is collected and why. Misuse often starts with shortcuts: copying data to test faster or sharing beyond the agreed purpose. Trust erodes slowly and is hard to rebuild.

Trust over time

Small choices add up

Careful use

Purpose and consent checked.

Shortcuts

Copying data, unclear retention.

Trust erosion

People lose confidence and push back.

Quick check. Ethics and trust

What is consent

Why does privacy matter

How is trust built

Scenario: A developer wants to use production customer data to test a feature quickly. What is the safer alternative

How does misuse often start

Why mention ethics early

What is one way to prevent trust erosion

🔗

How data connects everything you learn next

Data is the common thread across the other courses. AI models are only as good as the data they learn from. Cybersecurity controls protect data wherever it sits or moves. Software systems are structured flows of data shaped by design choices. Digital transformation is largely about improving how data is collected, shared, and trusted across journeys. Think of this page as the root note. The other tracks are variations.

Shared practice exercises

These exercises appear across courses so you build one habit: always question how data is used, protected, and interpreted.

📊

Data dashboards you will use throughout the site

These starter dashboards will grow with you. They stay simple now so you can focus on concepts, then expand in Intermediate and Advanced levels.

➡️

Where this takes you next

Data Intermediate digs into architecture, governance, and analytics that build on these habits. If you want to see how data powers other domains right away, jump to:

  1. Data Intermediate
  2. AI Foundations
  3. Cybersecurity Foundations
  4. Software Architecture Foundations
  5. Digitalisation Foundations

CPD support

This Foundations level blends theory and practice in a structured way. Later levels increase complexity and applied depth so your CPD record shows progression, not just exposure.

🧾

Verification and reflection (CPD evidence you can actually use)

You do not need to write a novel for CPD. You need to show judgement and a small change in practice. If you only do one thing after this level, do this: pick one dataset you touch at work (or in a personal project) and improve its meaning.

Reflection prompt (copy into CPD evidence)

  • What I studied: Data foundations, representation, formats, quality, lifecycle, roles, and ethics.
  • What I did: Used the in-browser tools to inspect encoding, quality issues, and lifecycle risks.
  • What I learned: One thing that surprised me about representation or quality, and why it matters.
  • What I will change: One concrete habit. Example: “I will record units and timestamp meaning as metadata before I build the dashboard.”
  • Evidence artefact: A screenshot or short note that shows the dataset, the issue found, and the correction.

My opinion on CPD

If your evidence does not change your behaviour, it is paperwork. Write the smallest honest note that proves you thought and acted.

Quick feedback

Optional. This helps improve accuracy and usefulness. No accounts required.