Data and representation

Why this matters

I want you to feel this in your bones because it shows up everywhere.

What you will be able to do

1 Explain data and representation in your own words and apply it to a realistic scenario.
2 Data becomes a model input through choices about meaning, encoding, and measurement.
3 Check the assumption "Features have stable meaning" and explain what changes if it is false.
4 Check the assumption "Training data matches use" and explain what changes if it is false.

Before you begin

No previous technical background required
Read the section explanation before using tools

Common ways people get this wrong

Proxy features. A model can learn an unwanted shortcut that correlates with the target. It can look fair on paper and still harm people.
Encoding surprises. Small representation choices can change behaviour. One-hot vs ordinal encoding is not a cosmetic decision.
Data drift. The same sensor, form, or logging pipeline can change after a release. If you do not notice, the model starts guessing.

Main idea at a glance

From raw data to numbers a model can learn from

Turn messy inputs into numeric representation.

Stage 1

Raw input

I start with emails, photos, sensor readings, or text. This is the real world: incomplete, noisy, inconsistent.

I think the quality of this step determines whether I build on solid ground or quick sand.

In AI, the word data sounds fancy, but it is usually boring. It is clicks, purchases, support tickets, photos, sensor readings, and text. Data always comes with context. Where did it come from. Who produced it. What is missing. What was measured badly. If you ignore that, you build a confident model on shaky ground.

Some data is structured. That means it fits neatly into rows and columns. Think customer age, number of failed logins, or time since last password reset. Other data is unstructured. That means it looks like raw text, images, audio, or long logs. It still has structure, but you have to extract it.

To train a model, we usually separate inputs from the answer we want. A feature is an input signal. A label is the outcome we want the model to learn to predict.

Models cannot understand raw text or images the way humans do. They do not see meaning. They see numbers. If you give a model a photo, it will be turned into numbers first. If you give it an email, it will be turned into numbers first. The model learns patterns in those numbers.

The simplest numeric form is a vector. For text, we first break it into a token. Then we map those tokens into an embedding.

The intuition is simple. If two pieces of text are used in similar ways, they often end up with similar numbers. A model can then treat closeness as a hint that the meaning is related. It is not perfect. It is a useful shortcut.

In a real system, representation choices show up as behaviour. If you represent a customer only by “spend last month”, the model may miss that a loyal customer is having a temporary issue. If you represent them by richer behaviour signals, the model may be more useful but also harder to explain.

Suppose you build a model using “postcode” as a feature because it predicts outcomes well. In practice, that can become a proxy for protected attributes. The representation can silently encode social patterns you did not intend to automate.

Bad data creates bad models. If the labels are wrong, the model learns the wrong lesson. If the data is missing whole groups of people, the model will fail on those groups. If the data reflects old behaviour, the model will struggle when the world changes. This is why data work is not busywork. It is the foundation.

Splitting data matters because we want honest feedback. Training data is what the model learns from. Validation data is what you use to make choices during building. Test data is the final check you keep separate until the end. If you test on the same data you trained on, you are grading your own homework with the answer sheet open.

Worked example. When "postcode" becomes a shortcut feature

I want you to feel this in your bones because it shows up everywhere. Imagine we build a model to predict whether a customer will miss a payment. Someone suggests adding postcode because it improves the score. The model gets better on the spreadsheet, and the temptation is to ship it.

Here is the uncomfortable truth: postcode can act as a proxy for things we should not automate in a crude way. You might not be explicitly using protected attributes, but proxies still bake in social history. If you do not check this carefully, you are not “data driven”, you are laundering old bias through a new system.

Common mistakes in data and representation

Data and representation mistakes to avoid

Treating identifiers as numbers

If a value is a label, keep it categorical. Numeric encoding can create fake order.
Confusing missing with zero

Missingness often carries signal. Collapsing it into zero hides useful context.
Allowing leakage from the future

Any feature unavailable at decision time will produce inflated offline performance.
Picking representation by convenience only

Easy encoding can block the model from learning the behaviour you actually care about.

Verification. A small checklist before you trust any dataset

Dataset trust checklist

Run this before model training so failure is found early instead of after release.

Explain each feature in plain English

Document what it means, how it is measured, and where it is sourced from.
Identify missing data and missing groups

Coverage gaps are often where fairness and reliability failures begin.
Audit label ownership

Confirm who created labels and what conditions make those labels unreliable.
Predict early drift

Name which features are likely to change first when business behaviour shifts.

After this section you should be able to

Section outcomes

Explain why representation choices change what a model can learn

Describe how encoding decisions create or remove learnable signal.
Explain what breaks when labels, groups, or context are missing

Identify where bias, instability, and blind spots enter the system.
Explain the trade-off between simple features and richer embeddings

Balance interpretability, coverage, and operational complexity.

Mental model

From event to feature

Data becomes a model input through choices about meaning, encoding, and measurement.

1

Real world event
2

Captured data
3

Cleaning and shaping
4

Features
5

Prediction

Assumptions to keep in mind

Features have stable meaning. A number is not useful unless we know what it measures, the unit, and the context. The same column name can mean different things across teams.
Training data matches use. If training examples do not resemble the real inputs, evaluation results are theatre. Representativeness beats cleverness.
We treat leakage as a defect. If information from the future or the label slips into the inputs, the model learns a shortcut. It looks accurate and fails in production.

Failure modes to notice

Proxy features. A model can learn an unwanted shortcut that correlates with the target. It can look fair on paper and still harm people.
Encoding surprises. Small representation choices can change behaviour. One-hot vs ordinal encoding is not a cosmetic decision.
Data drift. The same sensor, form, or logging pipeline can change after a release. If you do not notice, the model starts guessing.

Key terms

data: Data is recorded observations about the world. It is what the model learns from, not what we wish was true.
feature: A feature is a measurable piece of information about something. For example the length of an email or the number of failed login attempts.
label: A label is the correct answer for training. For example spam or not spam, or refund needed or not.
vector: A vector is a list of numbers that represents an input. The numbers are chosen so the model can work with them.
token: A token is a small chunk of text used as the unit for language models. It might be a word or part of a word.
embedding: An embedding is a numeric vector that tries to place similar items near each other in number space.

Check yourself

Quick check. Data and representation

0 of 12 opened

Scenario. A team says 'we have loads of data' but it is mostly outdated logs. In AI terms, what does data mean

Recorded observations about the world that the model learns from. If it is outdated or unrepresentative, it can mislead.

Scenario. Your dataset is a mix of spreadsheet rows and customer emails. What is the difference between structured and unstructured data

Structured fits rows and columns, unstructured is text, images, audio, or logs that need processing.

Scenario. In a spam filter, 'number of links' is used by the model. What is that

A feature: a measurable input signal used by the model.

Scenario. In training, each email is marked spam or not spam. What is that mark called

A label: the correct answer used for training.

Why can models not understand raw text or images directly

They operate on numbers, so inputs must be turned into numeric form.

Scenario. A model only accepts numbers. After processing, your email becomes a list of numbers. What is that representation called

A vector: a list of numbers that represents an input.

Scenario. You want similar documents to sit near each other for search. What is an embedding for

To represent items as numbers so similar items end up near each other.

Why do similar things often end up with similar numbers

The representation is trained to capture patterns of use and meaning as closeness.

Why do we split data into training, validation, and test sets

To learn, tune choices, and then do an honest final check without cheating.

Scenario. The model performs well in a demo but fails for a real user group. Name one way bad data creates bad models

Wrong labels, missing groups, or outdated data lead to confident but wrong behaviour.

What is data leakage in simple terms

When information that should not be available sneaks into training or evaluation, making results look better than reality.

Why is a single metric rarely enough

Because different mistakes matter differently, and one score can hide serious failure modes.

Artefact and reflection

Artefact

A short module note with one key definition and one practical example

Reflection

Where in your work would explain data and representation in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?

Optional practice

Type in a few short phrases and see how they turn into numeric vectors, then compare how similar they are.

Also in this module

Feature leakage and proxies practice

Practice spotting features that leak the answer or behave like a proxy. Learn what to remove, what to keep, and what to monitor.

Noise and labels practice

See how label noise and messy inputs cap performance, even if your model is powerful.

Overfitting explorer

Compare training performance against test performance and see how overfitting quietly makes a model look better than it is.

What AI is and why it matters now

Data and representation

Supervised and unsupervised learning

Why this matters

What you will be able to do

Main idea at a glance

From raw data to numbers a model can learn from

Raw input

Worked example. When "postcode" becomes a shortcut feature

Worked example. When "postcode" becomes a shortcut feature

Common mistakes in data and representation

Data and representation mistakes to avoid

Verification. A small checklist before you trust any dataset

Dataset trust checklist

After this section you should be able to

Section outcomes

Mental model

From event to feature

Key terms

Check yourself

Quick check. Data and representation

Artefact and reflection

Also in this module

Feature leakage and proxies practice

Noise and labels practice

Overfitting explorer