Applied · Module 2

Data, features and representation

Raw data is rarely ready for a model.

48 min 4 outcomes AI Intermediate

Previously

Models, parameters and training dynamics

A model is still a function that turns input into output.

This module

Data, features and representation

Raw data is rarely ready for a model.

Evaluation, metrics and failure analysis

Accuracy is an easy number to like because it feels clean.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

My opinion is that if the feature definition is vague, the model will punish you later with confidence and nonsense.

What you will be able to do

1 Explain data, features and representation in your own words and apply it to a realistic scenario.
2 Representation choices decide what the model can notice and what it will ignore.
3 Check the assumption "Meaning is preserved" and explain what changes if it is false.
4 Check the assumption "Bias is examined" and explain what changes if it is false.

Before you begin

Foundations-level vocabulary and concepts
Confidence with basic diagrams and section terminology

Common ways people get this wrong

Proxy features. A feature can stand in for a sensitive attribute even if you never included it directly.
Schema drift. A feature changes meaning after a release. The model keeps working and starts guessing.

Main idea at a glance

From raw data to features

How inputs become a feature vector the model can learn from.

Stage 1

Raw inputs

Start with data as it comes from the source: logs, databases, user events, sensors.

I think raw data quality is underestimated; garbage in guarantees garbage out.

Raw data is rarely ready for a model. Even when it looks clean to a human, it usually contains missing values, inconsistent formats, and little traps like duplicated records. A model does not understand intent. It only sees the numbers you give it, so messy inputs quietly become messy behaviour.

The first job is to decide what the model should pay attention to. A feature Features can be obvious, like the total price of a basket, or subtle, like the time since last login. Good features are stable, meaningful, and available at prediction time. Bad features leak information from the future or smuggle in an identifier that lets the model memorise.

Feature work is where most AI projects live or die

My opinion is that if the feature definition is vague, the model will punish you later with confidence and nonsense.

Good practice: For each feature, write what it represents, how it is calculated, and when it is available. If the answer is fuzzy, the feature is risky.
Bad practice: Using convenient fields because they correlate, without checking whether they still exist at prediction time. This is how leakage sneaks in wearing a friendly smile.
Best practice: Keep a simple feature register. It does not need to be a bureaucratic monster. It just needs to exist, be readable, and be updated when the pipeline changes.

People call this feature engineering. In practice it is careful translation. You are turning a real world situation into signals a model can learn from. If you pick the wrong signals, the model can look accurate in testing and still fail in production because it learned the wrong shortcut.

Representation is the bridge between raw input and features. A representation Sometimes the simplest representation is the best one. A single number for "days since password reset" can beat a complicated text field that mostly contains noise.

Text, images, and time series all need different treatments. For text, you might start with simple counts or categories, then move to an embedding Embeddings are powerful because they compress meaning into numbers, but they also hide failure modes. If your embedding model was trained on different language or different context, it can flatten important distinctions.

For images, raw pixels are numbers already, but not good ones by themselves. Lighting, cropping, and camera differences can dominate the signal. For time series, the shape over time matters. Averages can erase patterns, and misaligned timestamps can create fake trends that a model happily learns.

All of this affects dimensionality More dimensions can capture richer detail, but it increases the chance of learning coincidences. It also increases the cost of training and the risk that your model learns a brittle rule that only holds in the training set.

The hardest failures are silent. A noise If your pipeline adds noise, a model can still reduce loss by fitting patterns that do not generalise. You see an improvement on a familiar dataset and assume the model is smarter. In reality, you changed the data in a way that made the benchmark easier or leaked a hint.

When a model behaves strangely, look at representation before you blame the algorithm. Small encoding choices can flip what the model can and cannot learn. This is why data work is not "preprocessing". It is the main engineering work.

Mental model

Features control behaviour

Representation choices decide what the model can notice and what it will ignore.

1

Raw data
2

Prepare
3

Representation
4

Train
5

Behaviour

Assumptions to keep in mind

Meaning is preserved. If you lose meaning during encoding, the model learns an approximation you did not intend.
Bias is examined. Representation can encode bias. Check who is missing and who is overrepresented.

Failure modes to notice

Proxy features. A feature can stand in for a sensitive attribute even if you never included it directly.
Schema drift. A feature changes meaning after a release. The model keeps working and starts guessing.

Key terms

feature: A feature is a measurable input the model uses to make a prediction.
representation: A representation is the way you encode data so a model can use it.
embedding: An embedding is a numeric vector that places similar items near each other in a learned space.
dimensionality: Dimensionality is how many numbers are in your feature vector.
noise: Noise is random or irrelevant variation that hides the real signal.

Check yourself

Quick check on features and representation

0 of 10 opened

Why is raw data rarely usable directly

It is often messy, inconsistent, and not encoded in a stable way a model can learn from.

What is a feature

A measurable input the model uses to make a prediction.

What is representation in this context

The encoding choice that turns raw inputs into numbers the model can use.

Give an example of a feature that could leak the future

A field that includes an outcome label or post event status that is not available at prediction time.

What is an embedding used for

To encode items as vectors where similar items end up close together.

Scenario. A model performs well offline, but fails on new products the business launches. What representation risk might explain it

The encoding does not handle new categories or new vocabulary, or the embedding space was learned on an older world. The representation was not designed for change.

Why can embeddings hide problems

They compress information, so mismatched training context can erase important distinctions.

What does dimensionality refer to

How many numbers are in the feature vector.

Why can high dimensional features lead to brittle models

They make it easier to fit coincidences that do not generalise.

What is noise and why does it matter

Irrelevant variation that can drown out the real signal and mislead training.

Artefact and reflection

Artefact

A one-page decision note with assumption, evidence, and chosen action

Reflection

Where in your work would explain data, features and representation in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?

Optional practice

For each feature, write what it represents, how it is calculated, and when it is available. If the answer is fuzzy, the feature is risky.

Also in this module

Search with embeddings

Encode text as vectors and see how semantic similarity finds relevant results that keyword search misses.

Vector search in action

Build a small vector index, run queries and see how distance metrics affect retrieval results.