Revision is the slower recap surface. Use it to reread, scan, or print the stage in one place after you have already worked through the module-first path.
How to use revision properly
Revision is not the default route through the course. It is the slower surface for recap, printing, annotation, or audit-friendly reading after the shorter module pages have already done the teaching work in smaller chunks.
This level sets out how data exists, moves, and creates value before any heavy analysis or security work. It keeps the language simple, introduces light maths, and shows how real systems depend on data discipline.
Key terms you will need
If any word feels slippery later, come back to this list. It is the quickest way to reset your understanding.
What data is and why it matters
Learning contract
What data is and why it matters
Data starts as recorded observations, for example numbers on a meter, text in a form, or pixels in a photo. When we add structure it becomes information that people can read. When we apply it to decisions it becomes knowledge. Data existed long before computers, think bank ledgers, census books, and medical charts. Modern systems are data driven because every click, sensor, and transaction can be captured and turned into feedback.
Banking relies on clean transaction data to spot fraud. Energy grids depend on meter readings to balance supply and demand. Healthcare teams use lab results and symptoms to guide care. AI systems learn from past data to make predictions, which means they also inherit any gaps or mistakes. Keeping the difference between raw data, information, and knowledge clear helps us avoid mixing facts with opinions.
Here is the short version of how data becomes useful.
Interactive diagram
Keep your eye on meaning. A number is just a symbol until we agree what it stands for, how it was measured, and what decision it should guide.
Optional tool
Data around you
Classify everyday examples as data, information, or knowledge and see immediate feedback.
Use the related workspace if you need the live interactive version.
Knowledge check
Quick check. What data is and why it matters
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Data, information, knowledge, judgement
Learning contract
Data, information, knowledge, judgement
I want a simple model in your head that stays useful even when the tools change, and DIKW works because it forces you to separate raw observations from meaning before you make decisions.
DIKW (useful version)
From recorded observations to decisions you can defend
Data
Recorded observations.
Example: readings, clicks, timestamps.
Information
Data with context and meaning.
Example: units, location, who collected it, what it represents.
Knowledge
Patterns you can explain.
Example: demand rises at 18:00, outages cluster after storms.
Judgement
Action under uncertainty.
Example: intervene, hold, investigate, or automate with guardrails.
Worked example. A number without context is a rumour
Suppose a dashboard shows “12.4”, which could be 12.4 kWh, 12.4 MWh, 12.4 percent, 12.4 incidents, or 12.4 minutes, so the number itself is not the problem and the missing context is.
My opinion is that if you cannot answer “what does this represent” and “what would make it wrong”, you do not have information yet. You have vibes with a font size.
Common mistakes (DIKW edition)
Verification. Prove you can separate meaning from numbers
Knowledge check
Quick check. DIKW
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Units, notation, and the difference between percent and probability
Learning contract
Units, notation, and the difference between percent and probability
Data work goes wrong when people are casual about units. Units are not decoration. Units are the meaning.
This is why I teach it early and I teach it bluntly.
Worked example. kWh and MWh are both “energy” and still not the same number
If one dataset records energy in kWh and another records energy in MWh, then the same physical quantity will appear with numbers that differ by a factor of 1000.
A join can be perfectly correct and the final answer can be perfectly wrong.
A small cheat sheet you can reuse
Verification. Spot the three most common confusion traps
Knowledge check
Quick check. Units and notation
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Data representation and formats
Learning contract
Data representation and formats
Computers store everything using bits (binary digits) because hardware can reliably tell two states apart. A byte is eight bits, which can represent 256 distinct values. Encoding maps symbols to numbers, while a file format adds structure on top. CSV is plain text with commas, JSON wraps name value pairs, XML uses nested tags, images store grids of pixels, and audio stores wave samples. The wrong format or encoding breaks systems because the receiver cannot parse what was intended.
Think of representation in four layers. Each layer must stay consistent or the meaning collapses.
Interactive diagram
A byte can represent 0 to 255. Powers of two help size things: 23=8 means three binary places can represent eight values. Plain English: two multiplied by itself three times equals eight. Binary choices stack quickly.
Item
Meaning
Bit
Smallest unit, either 0 or 1
Byte
8 bits, often one character in simple encodings
2n
Number of combinations with n bits
Characters to numbers to bits
Encoding then binary storage
Characters
"A", "7", "e"
Numbers via encoding
"A" -> 65, "7" -> 55, "e" -> 101
Bits in memory
65 -> 01000001
Optional tool
Text to bytes visualiser
Type text and see characters turn into numbers and bits.
Use the related workspace if you need the live interactive version.
Worked example. The same text, different bytes
Here is a simple truth that causes surprising damage in real systems: the same characters can be stored as different bytes depending on encoding. If one system writes text as UTF-8 and another reads it as something else, the data is not “slightly wrong”. It is wrong.
My opinion: if your system depends on humans “remembering” encodings, it is already broken. It should be explicit in the interface contract and tested like any other behaviour.
Common mistakes (and how to avoid them)
Verification. How you know you understood it
Maths ladder (optional). From intuition to rigour
You can learn data without advanced maths, but you cannot become an expert without eventually becoming comfortable with symbols. The goal here is not to show off. It is to make the symbols friendly and precise.
Foundations. Powers of two and counting possibilities
If a system has n bits, each bit has two possible states (0 or 1). The total number of possible bit patterns is:
2n
n: number of bits (an integer)
2n: number of distinct patterns (how many different values you can represent)
Example: n=8 (one byte). Then 28=256. So a byte can represent 256 distinct values, typically 0 to 255.
Next step. Base conversion and why it matters for data
A binary number is a sum of powers of two. If you see 01000001, the 1s mark which powers are included:
That equals 64+1=65.
Why it matters: when data gets corrupted at the byte level (bad encoding, wrong parsing, truncation), the meaning upstream is gone. You cannot “fix it later” reliably because you do not know what the original bits were meant to represent.
Deeper. Information content (intuition)
The less predictable something is, the more information it carries. If a value is always the same, it carries no surprise.
A common formal measure is entropy. In the simplest discrete case:
H(X)=−x∑p(x)log2p(x)
X: a random variable (the thing that can take different values)
x: a particular value of X
p(x): probability that X=x
H(X): entropy in bits
Example: a fair coin has p(heads)=0.5, p(tails)=0.5. Then H(X)=1 bit. A biased coin has less.
Why it matters in data: highly predictable fields can still be important (for joining and identifiers), but they often carry little information for modelling. This is one reason “more columns” is not the same as “more value”.
Knowledge check
Quick check. Representation and formats
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Standards, schemas, and interoperability
Learning contract
Standards, schemas, and interoperability
Interoperability is a boring word for a very expensive problem. Two systems can both be “correct” and still disagree because they mean different things.
Standards are the shared rules that reduce translation work. Not because standards are morally pure, but because without shared meaning you spend your life reconciling spreadsheets and arguing in meetings.
What a standard really is
A standard can be a file format (CSV, JSON), a schema (field definitions), a data model (how entities relate), or a message contract (API request and response).
Good standards do two jobs. They make systems compatible and they make errors visible earlier.
Worked example. “Customer” broke your dashboard, not your code
System A records “customer” as the bill payer. System B records “customer” as the person who contacted support.
A dashboard joins them and reports “customers contacted”.
Leadership changes policy based on that number.
Nobody wrote a bug. The definition was the bug.
Common mistakes in standards work
Verification. A small contract you can write today
Pick one dataset. Write 5 fields with units, allowed ranges, and what “missing” means.
Write one identifier field and state whether it is a string or number, and why.
Write what changes are breaking, and how you would version them.
Knowledge check
Quick check. Standards and interoperability
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Open data, data sharing, and FAIR thinking
Learning contract
Open data, data sharing, and FAIR thinking
Open data is not “everything on the internet”. It is a choice about access and reuse.
Some data should be open because it improves transparency and innovation. Some data must stay restricted because it contains personal or security-sensitive information.
A mature organisation can explain the difference without hand-waving.
The data spectrum (closed, shared, open)
Most real-world data lives in the middle: shared with specific parties under agreements.
The useful question is not “open or closed”. It is “who can access, for what purpose, with what safeguards, and for how long”.
FAIR as a quality lens
FAIR means findable, accessible, interoperable, reusable. It does not automatically mean public.
It is a lens you can use to judge whether a dataset is actually usable by someone who is not already in your team.
Common mistakes in data sharing
Verification. Make a dataset shareable in a way you would trust
Write a title, description, update frequency, and contact owner.
List the units and definitions for key fields.
State what the dataset can and cannot be used for.
State whether it is open, shared, or restricted, and why.
Knowledge check
Quick check. Open data and FAIR
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Visualisation basics (so charts do not lie to you)
Learning contract
Visualisation basics (so charts do not lie to you)
Visualisation is part of data literacy. A chart is an argument. It can be honest or misleading.
The goal in Foundations is not to become a designer. The goal is to stop being fooled by bad charts, including your own.
Worked example. Same data, different story
Two charts show the same numbers. One uses a consistent scale. The other uses a cropped axis so small changes look huge.
If you react emotionally to the second chart, that is not a personal flaw. That is a design choice manipulating attention.
Verification. Four questions before you trust a chart
Knowledge check
Quick check. Visualisation basics
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Data quality and meaning
Learning contract
Data quality and meaning
Quality means data is accurate (close to the truth), complete (not missing key pieces), and timely (fresh enough to be useful). A sensor reading of 21°C is useless if the timestamp is missing. Noise is random variation that hides patterns, while signal is the meaningful part. Bias creeps in when some groups are missing or when measurements are skewed. Models and dashboards inherit these flaws because they cannot tell if the input is wrong.
Context and metadata preserve meaning: units, collection methods, and who collected the data. If a temperature has no unit, is it Celsius or Fahrenheit? Data without context invites bad decisions.
Worked example. One outlier can rewrite your story
Suppose we record response times for a service (in milliseconds): 110, 120, 115, 118, 5000.
The first four values look like a normal service. The last value could be a real outage or a measurement problem.
If you only report the average, you might accidentally tell everyone the service is slow when it is usually fine, or tell everyone it is fine when the tail behaviour is hurting real users.
My opinion: whenever someone shows me a single average, I immediately ask “what is the spread?” and “what does bad look like?”. That one habit saves weeks of nonsense.
Maths ladder. Signal, noise, and how to measure “typical”
Foundations. Mean (average) and why it can mislead
The mean of values x1,x2,…,xn is:
xˉ=n1i=1∑nxi
n: number of values
xi: the i-th value
xˉ: the mean
In the example 110, 120, 115, 118, 5000, the mean is pulled up sharply by 5000.
A level. Variance and standard deviation (spread)
A simple measure of spread is the (population) variance:
σ2=n1i=1∑n(xi−xˉ)2
σ2: variance
σ: standard deviation, where σ=σ2
Intuition: variance is “average squared distance from the mean”. Large variance means values are spread out.
Undergraduate. Sampling bias and missingness (why data can be wrong even when numbers are correct)
Data can be numerically correct and still misleading if the sample does not represent the population you care about. This is sampling bias.
Missingness matters too. Missing completely at random is rare in real systems. Often values are missing because of a reason (sensor downtime, people not completing forms, systems timing out).
When missingness has structure, it can distort analysis and models.
Rigour direction (taste, not a full course). Robust statistics
Real systems often produce heavy tails and outliers. Robust methods reduce sensitivity to extremes. Two examples you will meet in serious work:
Medians and quantiles: focus on typical behaviour and tail risk.
M-estimators: replace squared error with loss functions that punish outliers less aggressively than (x−μ)2.
You do not need to memorise these now. The point is to build the instinct: choose summaries that match the decision you are making.
Inspect a tiny dataset, add your notes, and reveal seeded issues.
Use the related workspace if you need the live interactive version.
Knowledge check
Quick check. Data quality and meaning
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Data lifecycle and flow
Learning contract
Data lifecycle and flow
Data starts at collection, gets stored, processed, shared, and eventually archived or deleted. Each step has design choices: where to store, how to process, how to secure, and when to retire. Software architecture cares about where components sit. Cybersecurity cares about protection at each hop. AI pipelines care about how raw data becomes features.
Deletion matters because stale data can mislead, cost money, or breach privacy. A clear lifecycle stops random copies and reduces attack surface.
End to end data lifecycle
Loop with ownership at every step
Collect
Forms, sensors, logs.
Store
Databases, lakes, queues.
Process
Cleaning, joins, enrichment.
Share
APIs, files, dashboards.
Archive or delete
Retention, compliance, cost control.
Interactive diagram
Optional tool
Lifecycle mapper
Order the lifecycle steps and see if the flow is healthy.
Use the related workspace if you need the live interactive version.
Knowledge check
Quick check. Lifecycle and flow
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Data roles and responsibilities
Learning contract
Data roles and responsibilities
Roles exist so someone is accountable for quality, access, and change. Data owners make decisions about purpose and access. Data stewards guard definitions, metadata, and policy. Data engineers build and maintain pipelines. Data analysts turn data into insights. Data consumers use the outputs responsibly. When roles blur, pipelines stall, privacy is ignored, or dashboards contradict each other.
Role responsibility map
Who does what and why it matters
Owner
Sets purpose, approves access.
Steward
Keeps definitions and metadata clean.
Engineer
Builds and operates pipelines safely.
Analyst and consumer
Uses outputs, shares insights, flags gaps.
Optional tool
Role matcher
Pair scenarios with the role responsible for the next action.
Use the related workspace if you need the live interactive version.
Knowledge check
Quick check. Roles and responsibilities
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
Foundations of data ethics and trust
Learning contract
Foundations of data ethics and trust
Ethics matters from the first data point. Consent means people know and agree to how their data is used. Privacy keeps personal details safe. Transparency builds trust because people can see what is collected and why. Misuse often starts with shortcuts: copying data to test faster or sharing beyond the agreed purpose. Trust erodes slowly and is hard to rebuild.
Trust over time
Small choices add up
Careful use
Purpose and consent checked.
Shortcuts
Copying data, unclear retention.
Trust erosion
People lose confidence and push back.
Optional tool
Ethics scenario helper
Pick the most responsible option for everyday data choices.
Use the related workspace if you need the live interactive version.
Knowledge check
Quick check. Ethics and trust
Use this prompt as a self-check before you move on.
Concept recap
This recap has been simplified for the document surface. Use the workspace if you need the expandable concept tool.
How data connects everything you learn next
Learning contract
How data connects everything you learn next
Data is the common thread across the other courses. AI models are only as good as the data they learn from. Cybersecurity controls protect data wherever it sits or moves. Software systems are structured flows of data shaped by design choices. Digital transformation is largely about improving how data is collected, shared, and trusted across journeys. Think of this page as the root note. The other tracks are variations.
Shared practice exercises
These exercises appear across courses so you build one habit: always question how data is used, protected, and interpreted.
Optional tool
The same data, different meanings
View one dataset through AI, cybersecurity, and business lenses to see how context shapes decisions.
Use the related workspace if you need the live interactive version.
Optional tool
Spot the data risks
Step through the lifecycle and reveal where leaks, misuse, or corruption can creep in.
Use the related workspace if you need the live interactive version.
Optional tool
From raw data to action
Trace how data becomes a decision and where quality or trust can fail along the way.
Use the related workspace if you need the live interactive version.
Data dashboards you will use throughout the site
Learning contract
Data dashboards you will use throughout the site
These starter dashboards will grow with you. They stay simple now so you can focus on concepts, then expand in Intermediate and Advanced levels.
Optional tool
Explore data formats
Toggle between CSV, JSON, images, and audio to see how structure changes usage.
Use the related workspace if you need the live interactive version.
Optional tool
Test data quality
Introduce missing values and noise to watch quality scores shift.
Use the related workspace if you need the live interactive version.
Optional tool
Visualise how data moves
Click through a small flow to see why boundaries and controls matter.
Use the related workspace if you need the live interactive version.
Where this takes you next
Learning contract
Where this takes you next
Data Intermediate digs into architecture, governance, and analytics that build on these habits. If you want to see how data powers other domains right away, jump to:
Verification and reflection (CPD evidence you can actually use)
Learning contract
Verification and reflection (CPD evidence you can actually use)
You do not need to write a novel for CPD. You need to show judgement and a small change in practice.
If you only do one thing after this level, do this: pick one dataset you touch at work (or in a personal project) and improve its meaning.