Loading lesson...
Loading lesson...
This is the sixth of 8 Foundations modules. You know how to evaluate a model (Module 5). Now the question is: what model architecture should you choose? This module covers the two families that dominated deep learning before transformers arrived: convolutional networks for spatial data and recurrent networks for sequential data.
AlexNet did not just win a competition. It demonstrated that neural networks could learn features directly from raw pixels, eliminating years of manual feature engineering. To understand why this works, you need to understand what a convolutional neural network actually does to an image.
If CNNs and RNNs are already familiar, use the knowledge checks to test yourself and move to Module 7: Responsible AI basics.
With the learning outcomes established, this module begins by examining convolutional neural networks: seeing patterns in space in depth.
A standard neural network treats each input as an independent number. For a 224×224 pixel colour image, that means 150,528 independent inputs (224 × 224 × 3 colour channels). This throws away all spatial structure: the network has no idea that pixel (10,10) is next to pixel (10,11). A CNN preserves spatial relationships by processing the image through a series of local operations.
A convolutional layer slides a small filter (typically 3×3 or 5×5 pixels) across the image. At each position, it computes a dot product between the filter weights and the underlying pixel values, producing a single number. The output of sliding the filter across the entire image is called a feature map. Different filters detect different patterns: edges, textures, corners, curves. The network learns which filters are useful during training.
After convolution, pooling layers reduce the spatial dimensions of the feature maps. Max pooling, the most common variant, takes the maximum value from each small region (e.g. 2×2). This has two effects: it reduces computation and it provides a degree of translation invariance, meaning the network can recognise a cat whether it appears in the top-left or bottom-right of the image.
CNNs stack multiple convolutional and pooling layers. Early layers detect low-level features (edges, colour gradients). Middle layers combine these into textures and parts (fur patterns, wheel shapes). Deep layers compose parts into whole objects (a dog, a car). This hierarchical feature learning is what makes CNNs so effective on visual data and why they eliminated the need for manual feature engineering.
“We find it interesting that the network has largely learned to separate color information from shape/texture information in the first layer.”
Krizhevsky, A., Sutskever, I. & Hinton, G.E., 'ImageNet Classification with Deep Convolutional Neural Networks' (2012) - Section 6.1, Qualitative Evaluation
The AlexNet paper observed that the network spontaneously discovered features resembling those that vision researchers had spent decades engineering by hand. The filters learned by the first layer closely resemble Gabor filters and colour blobs, the building blocks of biological vision.
With an understanding of convolutional neural networks: seeing patterns in space in place, the discussion can now turn to recurrent neural networks: processing sequences, which builds directly on these foundations.
CNNs excel at spatial data but have no notion of order. For data where sequence matters, such as text, speech, time series, or music, you need an architecture that processes inputs one step at a time and maintains a memory of what it has seen so far. This is the role of recurrent neural networks (RNNs).
At each time step, an RNN takes two inputs: the current element in the sequence and a hidden state that summarises everything the network has seen up to that point. It produces two outputs: a prediction for the current step and an updated hidden state that gets passed to the next step. This creates a chain of computation where information flows forward through the sequence.
In theory, an RNN can learn dependencies across arbitrarily long sequences. In practice, it cannot. During backpropagation through time, gradients are multiplied at each step. If the multiplicative factor is less than 1 (which it usually is), the gradient shrinks exponentially. After 20 or 30 steps, the gradient is effectively zero: the network cannot learn that a word at the beginning of a paragraph affects the meaning at the end. This is the vanishing gradient problem, and it was the primary limitation of vanilla RNNs for over a decade.
Common misconception
“RNNs can learn long-range dependencies because they have memory.”
Vanilla RNNs have memory in principle but not in practice. The vanishing gradient problem means that information from early time steps is exponentially diluted by the time it reaches later steps. An RNN processing a 200-word paragraph effectively forgets the first sentence by the time it reaches the last. This is why LSTMs and later transformers were necessary.
With an understanding of recurrent neural networks: processing sequences in place, the discussion can now turn to lstms: learning what to remember and forget, which builds directly on these foundations.
The Long Short-Term Memory (LSTM) network, introduced by Hochreiter and Schmidhuber in 1997, addresses the vanishing gradient problem with a mechanism that explicitly controls information flow. Instead of a single hidden state, an LSTM maintains a cell state: a highway that runs through the entire sequence with minimal modification.
Three gates control what happens to the cell state at each step:
This gating mechanism allows gradients to flow through the cell state without the multiplicative decay that kills vanilla RNNs. LSTMs can learn dependencies across hundreds of time steps, which made them the dominant architecture for machine translation, speech recognition, and text generation from roughly 2014 to 2017, until transformers superseded them.
“Unlike traditional recurrent neural networks, an LSTM network is well-suited to learning from experience to classify, process and predict time series when there are very long time lags of unknown size between important events.”
Hochreiter, S. & Schmidhuber, J., 'Long Short-Term Memory', Neural Computation (1997) - Abstract
The original LSTM paper identified the core problem (long time lags) and the core solution (constant error flow through the cell state). This architecture dominated sequence modelling for 20 years and remains widely used in time-series applications.
With an understanding of lstms: learning what to remember and forget in place, the discussion can now turn to when to use what, which builds directly on these foundations.
Architecture selection is not a matter of personal preference. It is determined by the structure of your data:
The general principle: choose the architecture whose inductive biases match the structure of your data. CNNs assume local spatial correlations. RNNs assume temporal ordering. Using the wrong architecture forces the network to learn structure that the right architecture gives it for free.
Common misconception
“Deep learning is always better than traditional machine learning.”
On tabular data with well-defined features, gradient-boosted trees frequently match or outperform deep learning while training in seconds instead of hours. A 2022 benchmark study by Grinsztajn et al. found that tree-based models were superior on 45 tabular datasets. Deep learning's advantage is on unstructured data (images, text, audio) where manual feature engineering is impractical.
A CNN processes a 224x224 image through three convolutional layers with max pooling after each. The spatial dimensions reduce from 224 to 112 to 56 to 28. Why does the number of feature maps (channels) typically increase at each layer?
You are building a model to predict stock prices from historical daily closing prices (a time series). Which architecture is most appropriate as a starting point?
The vanishing gradient problem means that vanilla RNNs cannot learn dependencies between distant time steps. How does the LSTM address this?
Full paper
The AlexNet paper that triggered the deep learning revolution in computer vision. Demonstrated that deep CNNs trained on GPUs could dramatically outperform hand-crafted feature engineering. Used as the opening case study.
Hochreiter, S. & Schmidhuber, J., 'Long Short-Term Memory', Neural Computation (1997)
Sections 1-4
The original LSTM paper. Identified the vanishing gradient problem formally and introduced the cell state with gated information flow as the solution. This architecture dominated sequence modelling for two decades.
LeCun, Y., Bengio, Y. & Hinton, G., 'Deep learning', Nature (2015)
Full review article
Authoritative review by the three pioneers of deep learning. Provides historical context for CNNs and RNNs and explains why depth enables hierarchical feature learning. Essential reading for understanding the field.
Full paper
Rigorous benchmark demonstrating that gradient-boosted trees outperform deep learning on 45 tabular datasets. Challenges the assumption that deep learning is universally superior and establishes when tree-based models remain the better choice.
Goodfellow, I., Bengio, Y. & Courville, A., Deep Learning (2016)
Chapters 9 (CNNs) and 10 (RNNs)
The standard deep learning textbook. Chapter 9 provides the mathematical foundation for convolution, pooling, and stride. Chapter 10 covers RNN unfolding, backpropagation through time, and the gating mechanisms in LSTMs and GRUs.
You now understand the two architecture families that dominated deep learning before 2017: CNNs for spatial data and RNNs/LSTMs for sequential data. But building powerful models raises a critical question: how do you ensure they are fair, explainable, and accountable? Module 7 introduces responsible AI: the frameworks, metrics, and practices that prevent AI systems from causing harm.
Module 6 of 24 · AI Foundations