Module 3 of 24 · Foundations

How machines learn

30 min read 3 outcomes Classifier lab + drag challenge 5 sources cited

In Modules 1 and 2, you learned what AI is and why data quality matters. This module answers the central question: how does a machine actually learn from data?

By the end of this module you will be able to:

Explain supervised, unsupervised, and reinforcement learning with real examples
Describe how a model learns by minimising a loss function
Distinguish training, validation, and test sets and explain why each matters

AlphaFold learned from examples: here is an amino acid sequence, and here is the 3D structure it folds into. Given enough examples, it discovered patterns that generalised to proteins it had never seen. That process, learning from examples to make predictions on new data, is the core of supervised machine learning. This module explains how it works.

If you are already familiar with learning paradigms and loss functions, use the knowledge checks to confirm your understanding and skip to Module 4: Neural networks from scratch.

With the learning outcomes established, this module begins by examining supervised learning: learning from labelled examples in depth.

3.1 Supervised learning: learning from labelled examples

Supervised learning is the most common paradigm in production AI. The "supervised" refers to the fact that the training data includes both inputs and correct answers (labels). The model's job is to learn a function that maps inputs to outputs accurately enough to work on new, unseen data.

Two main categories:

Classification: Predict a category. Is this email spam or not? Does this X-ray show pneumonia? What digit is in this image (0-9)? The output is a discrete class label.
Regression: Predict a continuous number. What will this house sell for? How many units will we sell next quarter? What temperature will it be tomorrow? The output is a numerical value.

AlphaFold is a supervised learning system: given an input (amino acid sequence) and labelled examples (known 3D structures), it learned to predict new structures. Most commercial AI applications (recommendation systems, fraud detection, medical imaging, language translation) are supervised learning.

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Tom Mitchell - Machine Learning (1997), Chapter 1, Definition 1.1
This is the standard formal definition of machine learning. In supervised learning: E is the labelled training data, T is the prediction task (classification or regression), and P is the accuracy or error metric. A system 'learns' if its performance improves as it sees more training data.

With an understanding of supervised learning: learning from labelled examples in place, the discussion can now turn to unsupervised learning: finding structure without labels, which builds directly on these foundations.

3.2 Unsupervised learning: finding structure without labels

Unsupervised learning works with data that has no labels. The model's task is to discover structure, patterns, or groupings in the data on its own.

Clustering: Group similar items together. K-means clustering might group customers into segments based on purchasing behaviour without being told what the segments should be.
Dimensionality reduction: Compress high-dimensional data into fewer dimensions while preserving important relationships. PCA (Principal Component Analysis) and t-SNE are common techniques used to visualise complex datasets.
Anomaly detection: Identify data points that do not fit the normal pattern. Useful for fraud detection, manufacturing defect detection, and network intrusion detection.

Unsupervised learning is harder to evaluate than supervised learning because there are no "correct answers" to compare against. Success depends on whether the discovered patterns are useful for the downstream task.

With an understanding of unsupervised learning: finding structure without labels in place, the discussion can now turn to reinforcement learning: learning from rewards and penalties, which builds directly on these foundations.

3.3 Reinforcement learning: learning from rewards and penalties

Reinforcement learning (RL) is different from both supervised and unsupervised learning. An RL agent interacts with an environment, takes actions, and receives rewards or penalties. The goal is to learn a policy (a strategy for choosing actions) that maximises cumulative reward over time.

Key concepts:

Agent: The learner that takes actions.
Environment: The world the agent interacts with.
State: The current situation.
Action: What the agent does.
Reward: Feedback signal (positive or negative).

DeepMind's AlphaGo (2016) used reinforcement learning to defeat the world champion at Go. It learned by playing millions of games against itself, receiving a reward of +1 for winning and -1 for losing. RLHF (Reinforcement Learning from Human Feedback) is used to align large language models with human preferences. We cover RL in depth in the Practice & Strategy stage (Module 21).

With an understanding of reinforcement learning: learning from rewards and penalties in place, the discussion can now turn to loss functions: how models measure their own mistakes, which builds directly on these foundations.

Common misconception

“Machine learning models understand the data they process”

ML models find mathematical patterns that correlate inputs with outputs. A model that classifies cats in images responds to pixel patterns (edges, textures, shapes) that statistically correlate with the label 'cat.' It has no concept of what a cat is. Models can learn spurious correlations: a famous study found that a model learned to classify 'wolf' partly based on snow in the background, not the animal itself. Always test models with adversarial examples and out-of-distribution data.

Common misconception

“You need a PhD in mathematics to understand machine learning”

The core concepts (learning from examples, minimising error, splitting data) are accessible to anyone with basic numeracy. The mathematical notation can be intimidating, but the underlying ideas are often simple. Linear regression, the foundation of many ML techniques, is finding the best-fit line through a scatter plot, something taught in secondary school mathematics. Focus on the intuition first; the formal notation becomes clearer once the concepts are solid.

3.4 Loss functions: how models measure their own mistakes

A loss function (also called a cost function or objective function) measures how wrong the model's predictions are. The model's goal during training is to adjust its internal parameters to make the loss as small as possible.

For regression, a common loss is mean squared error (MSE): for each prediction, calculate the difference between predicted and actual, square it (to penalise large errors more than small ones), and average across all examples.

For classification, a common loss is cross-entropy loss: it measures how far the model's predicted probability distribution is from the true distribution (where the correct class has probability 1 and all others have probability 0).

The choice of loss function shapes what the model optimises for. A model trained with MSE treats all prediction errors equally in proportion to their magnitude. A model trained with a custom loss that heavily penalises false negatives (missing a cancer diagnosis) will behave differently from one that penalises false positives (unnecessary biopsies) equally. The loss function encodes your priorities.

With an understanding of loss functions: how models measure their own mistakes in place, the discussion can now turn to training, validation, and test sets, which builds directly on these foundations.

3.5 Training, validation, and test sets

The most important concept in machine learning evaluation is that you must test your model on data it has never seen during training. Without this separation, you cannot know whether the model has learned generalisable patterns or has simply memorised the training examples (a problem called overfitting).

Training set (typically 60-80% of data): The model learns from this data. It sees inputs, makes predictions, compares them to labels, and adjusts its parameters.
Validation set (typically 10-20%): Used during training to tune hyperparameters (learning rate, model complexity, regularisation strength) and detect overfitting. The model does not learn from this data directly, but decisions about the model are influenced by validation performance.
Test set (typically 10-20%): Used once, at the very end, to estimate how the model will perform on truly unseen data. If you repeatedly check test set performance and adjust the model, the test set becomes a second validation set and can no longer give an unbiased estimate.

Data leakage occurs when information from the test set inadvertently influences training. Common causes include normalising the entire dataset before splitting (the mean and standard deviation include test data) or using features derived from the target variable. Data leakage produces misleadingly optimistic evaluation results.

“Overfitting is the central problem of machine learning.”
Pedro Domingos - The Master Algorithm (2015), Chapter 3
Domingos's claim is deliberately strong. Overfitting, when a model memorises training data rather than learning generalisable patterns, is the failure mode that every ML practitioner must guard against. The train/validation/test split, regularisation, and cross-validation are all defences against overfitting.

Loading interactive component...

3.6 Check your understanding

A retail company wants to group its customers into segments based on purchasing behaviour. No predefined categories exist. Which learning paradigm is most appropriate?

During training, a model's training loss continues to decrease but its validation loss starts increasing after epoch 8. What is happening and what should you do?

A data scientist normalises the entire dataset (calculating mean and standard deviation from all data) before splitting into train/test sets. Why is this problematic?

Loading interactive component...

Key takeaways

Supervised learning uses labelled data to learn a mapping from inputs to outputs. Classification predicts categories; regression predicts numbers. Most production AI is supervised learning.
Unsupervised learning discovers structure in unlabelled data through clustering, dimensionality reduction, and anomaly detection. It is harder to evaluate because there are no 'correct answers.'
Reinforcement learning trains agents through reward signals. It excels at sequential decision-making (games, robotics, dialogue systems) but is harder to train and less predictable than supervised learning.
A loss function measures prediction error. The model adjusts its parameters to minimise loss. The choice of loss function encodes your priorities, what kinds of errors matter most.
Train/validation/test splits prevent overfitting and data leakage. The training set teaches the model, the validation set guides tuning, and the test set gives an unbiased final estimate. Normalise after splitting, never before.

Standards and sources cited in this module

John Jumper et al., 'Highly accurate protein structure prediction with AlphaFold', Nature, Volume 596 (August 2021)
Results (CASP14 performance), Methods (Evoformer architecture)
The primary scientific publication describing AlphaFold 2. Reports the GDT score of 92.4 at CASP14 and explains the attention-based architecture. Used as the opening case study to illustrate supervised learning at scale.
Tom Mitchell, Machine Learning (1997)
Chapter 1, Definition 1.1; Chapter 2 (Concept Learning)
The foundational ML textbook definition. Mitchell's three-part formulation (task T, performance P, experience E) remains the standard way to formally define learning. Cited in Section 3.1.
Pedro Domingos, The Master Algorithm (2015)
Chapter 3 (Overfitting and the bias-variance tradeoff)
Accessible treatment of overfitting as the central challenge in ML. Used in Section 3.5 to frame why train/validation/test splits matter. Domingos presents overfitting as the problem every ML method must solve.
David Silver et al., 'Mastering the game of Go with deep neural networks and tree search', Nature, Volume 529 (January 2016)
Methods (Monte Carlo tree search + neural networks)
Primary reference for AlphaGo. Demonstrates reinforcement learning combined with deep neural networks to master Go, a game with more possible positions than atoms in the universe. Used in Section 3.3.
Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition (2022)
Chapter 1 (The Machine Learning Landscape), Chapter 2 (End-to-End ML Project)
The most widely used practical ML textbook. Chapters 1-2 provide the clearest accessible explanation of learning paradigms, loss functions, and train/test splits. Recommended as supplementary reading for this module.

You now understand the three learning paradigms, how loss functions guide learning, and why data splitting prevents overfitting. The next question is: what happens inside the model during training? Module 4 takes you inside a neural network, tracing how individual neurons compute, how layers combine, and how backpropagation adjusts weights to reduce error.

Previous: Data as fuel Next: Neural networks from scratch

Module 3 of 24 · AI Foundations

Loading lesson...

3.1 Supervised learning: learning from labelled examples

Two main categories:

Classification: Predict a category. Is this email spam or not? Does this X-ray show pneumonia? What digit is in this image (0-9)? The output is a discrete class label.
Regression: Predict a continuous number. What will this house sell for? How many units will we sell next quarter? What temperature will it be tomorrow? The output is a numerical value.

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Tom Mitchell - Machine Learning (1997), Chapter 1, Definition 1.1

This is the standard formal definition of machine learning. In supervised learning: E is the labelled training data, T is the prediction task (classification or regression), and P is the accuracy or error metric. A system 'learns' if its performance improves as it sees more training data.

3.2 Unsupervised learning: finding structure without labels

Unsupervised learning works with data that has no labels. The model's task is to discover structure, patterns, or groupings in the data on its own.

Clustering: Group similar items together. K-means clustering might group customers into segments based on purchasing behaviour without being told what the segments should be.
Dimensionality reduction: Compress high-dimensional data into fewer dimensions while preserving important relationships. PCA (Principal Component Analysis) and t-SNE are common techniques used to visualise complex datasets.
Anomaly detection: Identify data points that do not fit the normal pattern. Useful for fraud detection, manufacturing defect detection, and network intrusion detection.

3.3 Reinforcement learning: learning from rewards and penalties

Key concepts:

Agent: The learner that takes actions.
Environment: The world the agent interacts with.
State: The current situation.
Action: What the agent does.
Reward: Feedback signal (positive or negative).

3.4 Loss functions: how models measure their own mistakes

3.5 Training, validation, and test sets

Training set (typically 60-80% of data): The model learns from this data. It sees inputs, makes predictions, compares them to labels, and adjusts its parameters.
Validation set (typically 10-20%): Used during training to tune hyperparameters (learning rate, model complexity, regularisation strength) and detect overfitting. The model does not learn from this data directly, but decisions about the model are influenced by validation performance.
Test set (typically 10-20%): Used once, at the very end, to estimate how the model will perform on truly unseen data. If you repeatedly check test set performance and adjust the model, the test set becomes a second validation set and can no longer give an unbiased estimate.

Key takeaways

Supervised learning uses labelled data to learn a mapping from inputs to outputs. Classification predicts categories; regression predicts numbers. Most production AI is supervised learning.

Unsupervised learning discovers structure in unlabelled data through clustering, dimensionality reduction, and anomaly detection. It is harder to evaluate because there are no 'correct answers.'

Reinforcement learning trains agents through reward signals. It excels at sequential decision-making (games, robotics, dialogue systems) but is harder to train and less predictable than supervised learning.

A loss function measures prediction error. The model adjusts its parameters to minimise loss. The choice of loss function encodes your priorities, what kinds of errors matter most.

Train/validation/test splits prevent overfitting and data leakage. The training set teaches the model, the validation set guides tuning, and the test set gives an unbiased final estimate. Normalise after splitting, never before.

Standards and sources cited in this module

John Jumper et al., 'Highly accurate protein structure prediction with AlphaFold', Nature, Volume 596 (August 2021)

Results (CASP14 performance), Methods (Evoformer architecture)

The primary scientific publication describing AlphaFold 2. Reports the GDT score of 92.4 at CASP14 and explains the attention-based architecture. Used as the opening case study to illustrate supervised learning at scale.

Tom Mitchell, Machine Learning (1997)

Chapter 1, Definition 1.1; Chapter 2 (Concept Learning)

The foundational ML textbook definition. Mitchell's three-part formulation (task T, performance P, experience E) remains the standard way to formally define learning. Cited in Section 3.1.

Pedro Domingos, The Master Algorithm (2015)

Chapter 3 (Overfitting and the bias-variance tradeoff)

Accessible treatment of overfitting as the central challenge in ML. Used in Section 3.5 to frame why train/validation/test splits matter. Domingos presents overfitting as the problem every ML method must solve.

David Silver et al., 'Mastering the game of Go with deep neural networks and tree search', Nature, Volume 529 (January 2016)

Methods (Monte Carlo tree search + neural networks)

Primary reference for AlphaGo. Demonstrates reinforcement learning combined with deep neural networks to master Go, a game with more possible positions than atoms in the universe. Used in Section 3.3.

Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition (2022)

Chapter 1 (The Machine Learning Landscape), Chapter 2 (End-to-End ML Project)

The most widely used practical ML textbook. Chapters 1-2 provide the clearest accessible explanation of learning paradigms, loss functions, and train/test splits. Recommended as supplementary reading for this module.