Data Foundations · Module 8
Data quality and meaning
Quality means data is accurate (close to the truth), complete (not missing key pieces), and timely (fresh enough to be useful).
Previously
Visualisation basics (so charts do not lie to you)
Visualisation is part of data literacy.
This module
Data quality and meaning
Quality means data is accurate (close to the truth), complete (not missing key pieces), and timely (fresh enough to be useful).
Next
Data lifecycle and flow
Data starts at collection, gets stored, processed, shared, and eventually archived or deleted.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
Suppose we record response times for a service (in milliseconds): 110, 120, 115, 118, 5000.
What you will be able to do
- 1 Explain data quality and meaning in your own words and apply it to a realistic scenario.
- 2 Quality is not a one-time check. It is detection and repair over time.
- 3 Check the assumption "Quality has owners" and explain what changes if it is false.
- 4 Check the assumption "Expectations are written" and explain what changes if it is false.
Before you begin
- No previous technical background required
- Read the section explanation before using tools
Common ways people get this wrong
- Quality checked once. A one-off cleanup creates false confidence. Pipelines need ongoing checks.
- Fixing symptoms. If you only patch outputs, the same bug returns. Trace back to the source.
Quality means data is accurate (close to the truth), complete (not missing key pieces), and timely (fresh enough to be useful). A sensor reading of 21°C is useless if the timestamp is missing. Noise is random variation that hides patterns, while signal is the meaningful part. Bias creeps in when some groups are missing or when measurements are skewed. Models and dashboards inherit these flaws because they cannot tell if the input is wrong.
Context and metadata preserve meaning: units, collection methods, and who collected the data. If a temperature has no unit, is it Celsius or Fahrenheit? Data without context invites bad decisions.
Worked example. One outlier can rewrite your story
Worked example. One outlier can rewrite your story
Suppose we record response times for a service (in milliseconds): 110, 120, 115, 118, 5000. The first four values look like a normal service. The last value could be a real outage or a measurement problem. If you only report the average, you might accidentally tell everyone the service is slow when it is usually fine, or tell everyone it is fine when the tail behaviour is hurting real users.
My opinion: whenever someone shows me a single average, I immediately ask “what is the spread?” and “what does bad look like?”. That one habit saves weeks of nonsense.
Foundations. Mean (average) and why it can mislead
The mean of values is:
: number of values
: the -th value
: the mean
In the example 110, 120, 115, 118, 5000, the mean is pulled up sharply by 5000.
A level. Variance and standard deviation (spread)
A simple measure of spread is the (population) variance:
: variance
: standard deviation, where
Intuition: variance is “average squared distance from the mean”. Large variance means values are spread out.
Undergraduate. Sampling bias and missingness (why data can be wrong even when numbers are correct)
Data can be numerically correct and still misleading if the sample does not represent the population you care about. This is sampling bias. Missingness matters too. Missing completely at random is rare in real systems. Often values are missing because of a reason (sensor downtime, people not completing forms, systems timing out). When missingness has structure, it can distort analysis and models.
Rigour direction (taste, not a full course). Robust statistics
Real systems often produce heavy tails and outliers. Robust methods reduce sensitivity to extremes. Two examples you will meet in serious work:
Medians and quantiles: focus on typical behaviour and tail risk.
M-estimators: replace squared error with loss functions that punish outliers less aggressively than .
You do not need to memorise these now. The point is to build the instinct: choose summaries that match the decision you are making.
Common mistakes (the ones I keep seeing)
Common mistake
Treating missing values as just blanks
Reality: Missing data usually has a reason. Sensor failures, user drop-off, and system timeouts all look like blanks but mean different things. Ask why before you impute or ignore.
Common mistake
Removing outliers automatically
Reality: That extreme value might be a real failure you should investigate, not noise you should delete. Check before you clean.
Common mistake
Mixing units and trusting the graph
Reality: If one source reports in kW and another in MW, your join will be mathematically perfect and factually wrong by a factor of 1000.
Common mistake
Using accuracy language loosely
Reality: Accurate means close to the true value. A dashboard that looks plausible is not necessarily accurate. Know the difference.
Verification. Prove your understanding
Quality verification drill
Use one tiny dataset and prove that your reasoning is operational.
-
Write data meaning constraints
Document units, timestamp meaning, and acceptable ranges for a small dataset.
-
Separate identifiers from quantities
Choose one field of each type and justify storage type and usage.
-
Demonstrate noise versus bias
Give one concrete example of random variation and one of structural distortion.
Mental model
Quality is a loop
Quality is not a one-time check. It is detection and repair over time.
-
1
Expectations
-
2
Detect issues
-
3
Repair
-
4
Prevent recurrence
Assumptions to keep in mind
- Quality has owners. If ownership is unclear, quality issues become everybody’s problem and nobody’s job.
- Expectations are written. Rules must be explicit: ranges, nullability, units, and join keys.
Failure modes to notice
- Quality checked once. A one-off cleanup creates false confidence. Pipelines need ongoing checks.
- Fixing symptoms. If you only patch outputs, the same bug returns. Trace back to the source.
Check yourself
Quick check. Data quality and meaning
0 of 8 opened
What is accuracy
How close data is to the truth.
What is completeness
Having the needed fields present.
What is timeliness
Data is fresh enough to reflect reality.
Scenario. A key field is 30 percent missing for one region. What should you do before building a model or a dashboard
Find out why. Check whether collection failed, whether it is expected, and whether the missingness is correlated with something important. Then decide how to handle it and document the decision.
How does bias enter data
Missing groups, skewed measurements, or flawed collection.
Why do models inherit data problems
They learn from the input given, including errors.
Scenario. A number looks correct but decisions based on it are wrong. What is a common data reason
The definition or unit changed, or the context is missing. Without metadata, a correct number can still be misleading.
Why is metadata important
It explains units, source, and meaning so data is not misread.
Artefact and reflection
Artefact
A short module note with one key definition and one practical example
Reflection
Where in your work would explain data quality and meaning in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?
Optional practice
Inspect a tiny dataset, add your notes, and reveal seeded issues.