Applied · Module 2
Data, features and representation
Raw data is rarely ready for a model.
Previously
Models, parameters and training dynamics
A model is still a function that turns input into output.
This module
Data, features and representation
Raw data is rarely ready for a model.
Next
Evaluation, metrics and failure analysis
Accuracy is an easy number to like because it feels clean.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
My opinion is that if the feature definition is vague, the model will punish you later with confidence and nonsense.
What you will be able to do
- 1 Explain data, features and representation in your own words and apply it to a realistic scenario.
- 2 Representation choices decide what the model can notice and what it will ignore.
- 3 Check the assumption "Meaning is preserved" and explain what changes if it is false.
- 4 Check the assumption "Bias is examined" and explain what changes if it is false.
Before you begin
- Foundations-level vocabulary and concepts
- Confidence with basic diagrams and section terminology
Common ways people get this wrong
- Proxy features. A feature can stand in for a sensitive attribute even if you never included it directly.
- Schema drift. A feature changes meaning after a release. The model keeps working and starts guessing.
Main idea at a glance
From raw data to features
How inputs become a feature vector the model can learn from.
Stage 1
Raw inputs
Start with data as it comes from the source: logs, databases, user events, sensors.
I think raw data quality is underestimated; garbage in guarantees garbage out.
Raw data is rarely ready for a model. Even when it looks clean to a human, it usually contains missing values, inconsistent formats, and little traps like duplicated records. A model does not understand intent. It only sees the numbers you give it, so messy inputs quietly become messy behaviour.
The first job is to decide what the model should pay attention to. A feature Features can be obvious, like the total price of a basket, or subtle, like the time since last login. Good features are stable, meaningful, and available at prediction time. Bad features leak information from the future or smuggle in an identifier that lets the model memorise.
Feature work is where most AI projects live or die
My opinion is that if the feature definition is vague, the model will punish you later with confidence and nonsense.
- Good practice
- For each feature, write what it represents, how it is calculated, and when it is available. If the answer is fuzzy, the feature is risky.
- Bad practice
- Using convenient fields because they correlate, without checking whether they still exist at prediction time. This is how leakage sneaks in wearing a friendly smile.
- Best practice
- Keep a simple feature register. It does not need to be a bureaucratic monster. It just needs to exist, be readable, and be updated when the pipeline changes.
People call this feature engineering. In practice it is careful translation. You are turning a real world situation into signals a model can learn from. If you pick the wrong signals, the model can look accurate in testing and still fail in production because it learned the wrong shortcut.
Representation is the bridge between raw input and features. A representation Sometimes the simplest representation is the best one. A single number for "days since password reset" can beat a complicated text field that mostly contains noise.
Text, images, and time series all need different treatments. For text, you might start with simple counts or categories, then move to an embedding Embeddings are powerful because they compress meaning into numbers, but they also hide failure modes. If your embedding model was trained on different language or different context, it can flatten important distinctions.
For images, raw pixels are numbers already, but not good ones by themselves. Lighting, cropping, and camera differences can dominate the signal. For time series, the shape over time matters. Averages can erase patterns, and misaligned timestamps can create fake trends that a model happily learns.
All of this affects dimensionality More dimensions can capture richer detail, but it increases the chance of learning coincidences. It also increases the cost of training and the risk that your model learns a brittle rule that only holds in the training set.
The hardest failures are silent. A noise If your pipeline adds noise, a model can still reduce loss by fitting patterns that do not generalise. You see an improvement on a familiar dataset and assume the model is smarter. In reality, you changed the data in a way that made the benchmark easier or leaked a hint.
When a model behaves strangely, look at representation before you blame the algorithm. Small encoding choices can flip what the model can and cannot learn. This is why data work is not "preprocessing". It is the main engineering work.
Mental model
Features control behaviour
Representation choices decide what the model can notice and what it will ignore.
-
1
Raw data
-
2
Prepare
-
3
Representation
-
4
Train
-
5
Behaviour
Assumptions to keep in mind
- Meaning is preserved. If you lose meaning during encoding, the model learns an approximation you did not intend.
- Bias is examined. Representation can encode bias. Check who is missing and who is overrepresented.
Failure modes to notice
- Proxy features. A feature can stand in for a sensitive attribute even if you never included it directly.
- Schema drift. A feature changes meaning after a release. The model keeps working and starts guessing.
Key terms
- feature
- A feature is a measurable input the model uses to make a prediction.
- representation
- A representation is the way you encode data so a model can use it.
- embedding
- An embedding is a numeric vector that places similar items near each other in a learned space.
- dimensionality
- Dimensionality is how many numbers are in your feature vector.
- noise
- Noise is random or irrelevant variation that hides the real signal.
Check yourself
Quick check on features and representation
0 of 10 opened
Why is raw data rarely usable directly
It is often messy, inconsistent, and not encoded in a stable way a model can learn from.
What is a feature
A measurable input the model uses to make a prediction.
What is representation in this context
The encoding choice that turns raw inputs into numbers the model can use.
Give an example of a feature that could leak the future
A field that includes an outcome label or post event status that is not available at prediction time.
What is an embedding used for
To encode items as vectors where similar items end up close together.
Scenario. A model performs well offline, but fails on new products the business launches. What representation risk might explain it
The encoding does not handle new categories or new vocabulary, or the embedding space was learned on an older world. The representation was not designed for change.
Why can embeddings hide problems
They compress information, so mismatched training context can erase important distinctions.
What does dimensionality refer to
How many numbers are in the feature vector.
Why can high dimensional features lead to brittle models
They make it easier to fit coincidences that do not generalise.
What is noise and why does it matter
Irrelevant variation that can drown out the real signal and mislead training.
Artefact and reflection
Artefact
A one-page decision note with assumption, evidence, and chosen action
Reflection
Where in your work would explain data, features and representation in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?
Optional practice
For each feature, write what it represents, how it is calculated, and when it is available. If the answer is fuzzy, the feature is risky.
Also in this module
Search with embeddings
Encode text as vectors and see how semantic similarity finds relevant results that keyword search misses.
Vector search in action
Build a small vector index, run queries and see how distance metrics affect retrieval results.