Loading lesson...
Loading lesson...
In Modules 1 and 2, you learned what AI is and why data quality matters. This module answers the central question: how does a machine actually learn from data?
AlphaFold learned from examples: here is an amino acid sequence, and here is the 3D structure it folds into. Given enough examples, it discovered patterns that generalised to proteins it had never seen. That process, learning from examples to make predictions on new data, is the core of supervised machine learning. This module explains how it works.
If you are already familiar with learning paradigms and loss functions, use the knowledge checks to confirm your understanding and skip to Module 4: Neural networks from scratch.
With the learning outcomes established, this module begins by examining supervised learning: learning from labelled examples in depth.
Supervised learning is the most common paradigm in production AI. The "supervised" refers to the fact that the training data includes both inputs and correct answers (labels). The model's job is to learn a function that maps inputs to outputs accurately enough to work on new, unseen data.
Two main categories:
AlphaFold is a supervised learning system: given an input (amino acid sequence) and labelled examples (known 3D structures), it learned to predict new structures. Most commercial AI applications (recommendation systems, fraud detection, medical imaging, language translation) are supervised learning.
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Tom Mitchell - Machine Learning (1997), Chapter 1, Definition 1.1
This is the standard formal definition of machine learning. In supervised learning: E is the labelled training data, T is the prediction task (classification or regression), and P is the accuracy or error metric. A system 'learns' if its performance improves as it sees more training data.
With an understanding of supervised learning: learning from labelled examples in place, the discussion can now turn to unsupervised learning: finding structure without labels, which builds directly on these foundations.
Unsupervised learning works with data that has no labels. The model's task is to discover structure, patterns, or groupings in the data on its own.
Unsupervised learning is harder to evaluate than supervised learning because there are no "correct answers" to compare against. Success depends on whether the discovered patterns are useful for the downstream task.
With an understanding of unsupervised learning: finding structure without labels in place, the discussion can now turn to reinforcement learning: learning from rewards and penalties, which builds directly on these foundations.
Reinforcement learning (RL) is different from both supervised and unsupervised learning. An RL agent interacts with an environment, takes actions, and receives rewards or penalties. The goal is to learn a policy (a strategy for choosing actions) that maximises cumulative reward over time.
Key concepts:
DeepMind's AlphaGo (2016) used reinforcement learning to defeat the world champion at Go. It learned by playing millions of games against itself, receiving a reward of +1 for winning and -1 for losing. RLHF (Reinforcement Learning from Human Feedback) is used to align large language models with human preferences. We cover RL in depth in the Practice & Strategy stage (Module 21).
With an understanding of reinforcement learning: learning from rewards and penalties in place, the discussion can now turn to loss functions: how models measure their own mistakes, which builds directly on these foundations.
Common misconception
“Machine learning models understand the data they process”
ML models find mathematical patterns that correlate inputs with outputs. A model that classifies cats in images responds to pixel patterns (edges, textures, shapes) that statistically correlate with the label 'cat.' It has no concept of what a cat is. Models can learn spurious correlations: a famous study found that a model learned to classify 'wolf' partly based on snow in the background, not the animal itself. Always test models with adversarial examples and out-of-distribution data.
Common misconception
“You need a PhD in mathematics to understand machine learning”
The core concepts (learning from examples, minimising error, splitting data) are accessible to anyone with basic numeracy. The mathematical notation can be intimidating, but the underlying ideas are often simple. Linear regression, the foundation of many ML techniques, is finding the best-fit line through a scatter plot, something taught in secondary school mathematics. Focus on the intuition first; the formal notation becomes clearer once the concepts are solid.
A loss function (also called a cost function or objective function) measures how wrong the model's predictions are. The model's goal during training is to adjust its internal parameters to make the loss as small as possible.
For regression, a common loss is mean squared error (MSE): for each prediction, calculate the difference between predicted and actual, square it (to penalise large errors more than small ones), and average across all examples.
For classification, a common loss is cross-entropy loss: it measures how far the model's predicted probability distribution is from the true distribution (where the correct class has probability 1 and all others have probability 0).
The choice of loss function shapes what the model optimises for. A model trained with MSE treats all prediction errors equally in proportion to their magnitude. A model trained with a custom loss that heavily penalises false negatives (missing a cancer diagnosis) will behave differently from one that penalises false positives (unnecessary biopsies) equally. The loss function encodes your priorities.
With an understanding of loss functions: how models measure their own mistakes in place, the discussion can now turn to training, validation, and test sets, which builds directly on these foundations.
The most important concept in machine learning evaluation is that you must test your model on data it has never seen during training. Without this separation, you cannot know whether the model has learned generalisable patterns or has simply memorised the training examples (a problem called overfitting).
Data leakage occurs when information from the test set inadvertently influences training. Common causes include normalising the entire dataset before splitting (the mean and standard deviation include test data) or using features derived from the target variable. Data leakage produces misleadingly optimistic evaluation results.
“Overfitting is the central problem of machine learning.”
Pedro Domingos - The Master Algorithm (2015), Chapter 3
Domingos's claim is deliberately strong. Overfitting, when a model memorises training data rather than learning generalisable patterns, is the failure mode that every ML practitioner must guard against. The train/validation/test split, regularisation, and cross-validation are all defences against overfitting.
A retail company wants to group its customers into segments based on purchasing behaviour. No predefined categories exist. Which learning paradigm is most appropriate?
During training, a model's training loss continues to decrease but its validation loss starts increasing after epoch 8. What is happening and what should you do?
A data scientist normalises the entire dataset (calculating mean and standard deviation from all data) before splitting into train/test sets. Why is this problematic?
Results (CASP14 performance), Methods (Evoformer architecture)
The primary scientific publication describing AlphaFold 2. Reports the GDT score of 92.4 at CASP14 and explains the attention-based architecture. Used as the opening case study to illustrate supervised learning at scale.
Tom Mitchell, Machine Learning (1997)
Chapter 1, Definition 1.1; Chapter 2 (Concept Learning)
The foundational ML textbook definition. Mitchell's three-part formulation (task T, performance P, experience E) remains the standard way to formally define learning. Cited in Section 3.1.
Pedro Domingos, The Master Algorithm (2015)
Chapter 3 (Overfitting and the bias-variance tradeoff)
Accessible treatment of overfitting as the central challenge in ML. Used in Section 3.5 to frame why train/validation/test splits matter. Domingos presents overfitting as the problem every ML method must solve.
Methods (Monte Carlo tree search + neural networks)
Primary reference for AlphaGo. Demonstrates reinforcement learning combined with deep neural networks to master Go, a game with more possible positions than atoms in the universe. Used in Section 3.3.
Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition (2022)
Chapter 1 (The Machine Learning Landscape), Chapter 2 (End-to-End ML Project)
The most widely used practical ML textbook. Chapters 1-2 provide the clearest accessible explanation of learning paradigms, loss functions, and train/test splits. Recommended as supplementary reading for this module.
You now understand the three learning paradigms, how loss functions guide learning, and why data splitting prevents overfitting. The next question is: what happens inside the model during training? Module 4 takes you inside a neural network, tracing how individual neurons compute, how layers combine, and how backpropagation adjusts weights to reduce error.
Module 3 of 24 · AI Foundations