Module 6 of 24 · Foundations

Deep learning architectures

40 min read 3 outcomes 1 interactive tool + drag challenge 5 standards cited

This is the sixth of 8 Foundations modules. You know how to evaluate a model (Module 5). Now the question is: what model architecture should you choose? This module covers the two families that dominated deep learning before transformers arrived: convolutional networks for spatial data and recurrent networks for sequential data.

By the end of this module you will be able to:

Explain how convolution, pooling, and feature maps enable CNNs to recognise spatial patterns
Describe why RNNs struggle with long sequences and how LSTMs address the vanishing gradient problem
Match each architecture to its appropriate use case and explain why the choice matters

AlexNet did not just win a competition. It demonstrated that neural networks could learn features directly from raw pixels, eliminating years of manual feature engineering. To understand why this works, you need to understand what a convolutional neural network actually does to an image.

If CNNs and RNNs are already familiar, use the knowledge checks to test yourself and move to Module 7: Responsible AI basics.

With the learning outcomes established, this module begins by examining convolutional neural networks: seeing patterns in space in depth.

6.1 Convolutional neural networks: seeing patterns in space

A standard neural network treats each input as an independent number. For a 224×224 pixel colour image, that means 150,528 independent inputs (224 × 224 × 3 colour channels). This throws away all spatial structure: the network has no idea that pixel (10,10) is next to pixel (10,11). A CNN preserves spatial relationships by processing the image through a series of local operations.

6.1.1 Convolution: the sliding filter

A convolutional layer slides a small filter (typically 3×3 or 5×5 pixels) across the image. At each position, it computes a dot product between the filter weights and the underlying pixel values, producing a single number. The output of sliding the filter across the entire image is called a feature map. Different filters detect different patterns: edges, textures, corners, curves. The network learns which filters are useful during training.

6.1.2 Pooling: reducing dimensions

After convolution, pooling layers reduce the spatial dimensions of the feature maps. Max pooling, the most common variant, takes the maximum value from each small region (e.g. 2×2). This has two effects: it reduces computation and it provides a degree of translation invariance, meaning the network can recognise a cat whether it appears in the top-left or bottom-right of the image.

6.1.3 Depth: from edges to objects

CNNs stack multiple convolutional and pooling layers. Early layers detect low-level features (edges, colour gradients). Middle layers combine these into textures and parts (fur patterns, wheel shapes). Deep layers compose parts into whole objects (a dog, a car). This hierarchical feature learning is what makes CNNs so effective on visual data and why they eliminated the need for manual feature engineering.

“We find it interesting that the network has largely learned to separate color information from shape/texture information in the first layer.”
Krizhevsky, A., Sutskever, I. & Hinton, G.E., 'ImageNet Classification with Deep Convolutional Neural Networks' (2012) - Section 6.1, Qualitative Evaluation
The AlexNet paper observed that the network spontaneously discovered features resembling those that vision researchers had spent decades engineering by hand. The filters learned by the first layer closely resemble Gabor filters and colour blobs, the building blocks of biological vision.

With an understanding of convolutional neural networks: seeing patterns in space in place, the discussion can now turn to recurrent neural networks: processing sequences, which builds directly on these foundations.

AI computer vision recognising objects in the real world — CNNs power the computer vision systems in autonomous vehicles, medical imaging, satellite analysis, and manufacturing quality control. Each application builds on the same convolution-pooling-depth architecture.

Loading interactive component...

6.2 Recurrent neural networks: processing sequences

CNNs excel at spatial data but have no notion of order. For data where sequence matters, such as text, speech, time series, or music, you need an architecture that processes inputs one step at a time and maintains a memory of what it has seen so far. This is the role of recurrent neural networks (RNNs).

At each time step, an RNN takes two inputs: the current element in the sequence and a hidden state that summarises everything the network has seen up to that point. It produces two outputs: a prediction for the current step and an updated hidden state that gets passed to the next step. This creates a chain of computation where information flows forward through the sequence.

6.2.1 The vanishing gradient problem

In theory, an RNN can learn dependencies across arbitrarily long sequences. In practice, it cannot. During backpropagation through time, gradients are multiplied at each step. If the multiplicative factor is less than 1 (which it usually is), the gradient shrinks exponentially. After 20 or 30 steps, the gradient is effectively zero: the network cannot learn that a word at the beginning of a paragraph affects the meaning at the end. This is the vanishing gradient problem, and it was the primary limitation of vanilla RNNs for over a decade.

Common misconception

“RNNs can learn long-range dependencies because they have memory.”

Vanilla RNNs have memory in principle but not in practice. The vanishing gradient problem means that information from early time steps is exponentially diluted by the time it reaches later steps. An RNN processing a 200-word paragraph effectively forgets the first sentence by the time it reaches the last. This is why LSTMs and later transformers were necessary.

With an understanding of recurrent neural networks: processing sequences in place, the discussion can now turn to lstms: learning what to remember and forget, which builds directly on these foundations.

6.3 LSTMs: learning what to remember and forget

The Long Short-Term Memory (LSTM) network, introduced by Hochreiter and Schmidhuber in 1997, addresses the vanishing gradient problem with a mechanism that explicitly controls information flow. Instead of a single hidden state, an LSTM maintains a cell state: a highway that runs through the entire sequence with minimal modification.

Three gates control what happens to the cell state at each step:

Forget gate: decides what information to discard from the cell state. When processing a new sentence, the LSTM might forget the subject of the previous sentence.
Input gate: decides what new information to store. When encountering a new subject, the LSTM updates the cell state to reflect it.
Output gate: decides what part of the cell state to expose as the hidden state for the current time step. Not everything stored needs to be used immediately.

This gating mechanism allows gradients to flow through the cell state without the multiplicative decay that kills vanilla RNNs. LSTMs can learn dependencies across hundreds of time steps, which made them the dominant architecture for machine translation, speech recognition, and text generation from roughly 2014 to 2017, until transformers superseded them.

“Unlike traditional recurrent neural networks, an LSTM network is well-suited to learning from experience to classify, process and predict time series when there are very long time lags of unknown size between important events.”
Hochreiter, S. & Schmidhuber, J., 'Long Short-Term Memory', Neural Computation (1997) - Abstract
The original LSTM paper identified the core problem (long time lags) and the core solution (constant error flow through the cell state). This architecture dominated sequence modelling for 20 years and remains widely used in time-series applications.

With an understanding of lstms: learning what to remember and forget in place, the discussion can now turn to when to use what, which builds directly on these foundations.

6.4 When to use what

Architecture selection is not a matter of personal preference. It is determined by the structure of your data:

Spatial data (images, maps, medical scans): CNNs. The local connectivity and weight sharing of convolutional layers are specifically designed to exploit spatial structure. Using a fully connected network on images wastes parameters and ignores the fact that nearby pixels are related.
Sequential data (text, audio, time series): RNNs/LSTMs for moderate-length sequences, or transformers (Module 9) for long sequences. The choice depends on sequence length, computational budget, and whether you need real-time inference.
Tabular data (spreadsheets, structured databases): Gradient-boosted trees (XGBoost, LightGBM) often outperform deep learning on tabular data. This surprises many newcomers. Deep learning is not universally superior; it excels on unstructured data where feature engineering is hard.

The general principle: choose the architecture whose inductive biases match the structure of your data. CNNs assume local spatial correlations. RNNs assume temporal ordering. Using the wrong architecture forces the network to learn structure that the right architecture gives it for free.

Common misconception

“Deep learning is always better than traditional machine learning.”

On tabular data with well-defined features, gradient-boosted trees frequently match or outperform deep learning while training in seconds instead of hours. A 2022 benchmark study by Grinsztajn et al. found that tree-based models were superior on 45 tabular datasets. Deep learning's advantage is on unstructured data (images, text, audio) where manual feature engineering is impractical.

6.5 Check your understanding

A CNN processes a 224x224 image through three convolutional layers with max pooling after each. The spatial dimensions reduce from 224 to 112 to 56 to 28. Why does the number of feature maps (channels) typically increase at each layer?

You are building a model to predict stock prices from historical daily closing prices (a time series). Which architecture is most appropriate as a starting point?

The vanishing gradient problem means that vanilla RNNs cannot learn dependencies between distant time steps. How does the LSTM address this?

Loading interactive component...

Key takeaways

CNNs use convolution (sliding filters), pooling (dimension reduction), and depth (hierarchical abstraction) to learn spatial features directly from raw pixels. AlexNet demonstrated in 2012 that this approach could halve error rates compared to hand-crafted features.
RNNs process sequential data by maintaining a hidden state that carries information forward through the sequence. In theory this enables arbitrary-length memory; in practice, the vanishing gradient problem limits vanilla RNNs to short-range dependencies.
LSTMs solve the vanishing gradient problem with a cell state and three gates (forget, input, output) that control information flow using additive updates. They dominated sequence modelling from 2014 to 2017.
Architecture choice should match data structure: CNNs for spatial data, RNNs/LSTMs for sequences, gradient-boosted trees for tabular data. Using the wrong architecture forces the model to learn structure that the right one provides for free.
Deep learning is not universally superior. On tabular data with well-defined features, tree-based models frequently match or outperform neural networks while being faster and more interpretable. The right tool depends on the problem.

Standards and sources cited in this module

Krizhevsky, A., Sutskever, I. & Hinton, G.E., 'ImageNet Classification with Deep Convolutional Neural Networks', NeurIPS (2012)
Full paper
The AlexNet paper that triggered the deep learning revolution in computer vision. Demonstrated that deep CNNs trained on GPUs could dramatically outperform hand-crafted feature engineering. Used as the opening case study.
Hochreiter, S. & Schmidhuber, J., 'Long Short-Term Memory', Neural Computation (1997)
Sections 1-4
The original LSTM paper. Identified the vanishing gradient problem formally and introduced the cell state with gated information flow as the solution. This architecture dominated sequence modelling for two decades.
LeCun, Y., Bengio, Y. & Hinton, G., 'Deep learning', Nature (2015)
Full review article
Authoritative review by the three pioneers of deep learning. Provides historical context for CNNs and RNNs and explains why depth enables hierarchical feature learning. Essential reading for understanding the field.
Grinsztajn, L., Oyallon, E. & Varoquaux, G., 'Why do tree-based models still outperform deep learning on typical tabular data?', NeurIPS (2022)
Full paper
Rigorous benchmark demonstrating that gradient-boosted trees outperform deep learning on 45 tabular datasets. Challenges the assumption that deep learning is universally superior and establishes when tree-based models remain the better choice.
Goodfellow, I., Bengio, Y. & Courville, A., Deep Learning (2016)
Chapters 9 (CNNs) and 10 (RNNs)
The standard deep learning textbook. Chapter 9 provides the mathematical foundation for convolution, pooling, and stride. Chapter 10 covers RNN unfolding, backpropagation through time, and the gating mechanisms in LSTMs and GRUs.

You now understand the two architecture families that dominated deep learning before 2017: CNNs for spatial data and RNNs/LSTMs for sequential data. But building powerful models raises a critical question: how do you ensure they are fair, explainable, and accountable? Module 7 introduces responsible AI: the frameworks, metrics, and practices that prevent AI systems from causing harm.

Previous: Evaluating AI Next: Responsible AI basics

Module 6 of 24 · AI Foundations

Loading lesson...

6.1 Convolutional neural networks: seeing patterns in space

6.1.1 Convolution: the sliding filter

6.1.2 Pooling: reducing dimensions

6.1.3 Depth: from edges to objects

“We find it interesting that the network has largely learned to separate color information from shape/texture information in the first layer.”

Krizhevsky, A., Sutskever, I. & Hinton, G.E., 'ImageNet Classification with Deep Convolutional Neural Networks' (2012) - Section 6.1, Qualitative Evaluation

The AlexNet paper observed that the network spontaneously discovered features resembling those that vision researchers had spent decades engineering by hand. The filters learned by the first layer closely resemble Gabor filters and colour blobs, the building blocks of biological vision.

6.2 Recurrent neural networks: processing sequences

6.2.1 The vanishing gradient problem

Common misconception

“RNNs can learn long-range dependencies because they have memory.”

6.3 LSTMs: learning what to remember and forget

Three gates control what happens to the cell state at each step:

Forget gate: decides what information to discard from the cell state. When processing a new sentence, the LSTM might forget the subject of the previous sentence.
Input gate: decides what new information to store. When encountering a new subject, the LSTM updates the cell state to reflect it.
Output gate: decides what part of the cell state to expose as the hidden state for the current time step. Not everything stored needs to be used immediately.

“Unlike traditional recurrent neural networks, an LSTM network is well-suited to learning from experience to classify, process and predict time series when there are very long time lags of unknown size between important events.”

Hochreiter, S. & Schmidhuber, J., 'Long Short-Term Memory', Neural Computation (1997) - Abstract

The original LSTM paper identified the core problem (long time lags) and the core solution (constant error flow through the cell state). This architecture dominated sequence modelling for 20 years and remains widely used in time-series applications.

6.4 When to use what

Architecture selection is not a matter of personal preference. It is determined by the structure of your data:

Spatial data (images, maps, medical scans): CNNs. The local connectivity and weight sharing of convolutional layers are specifically designed to exploit spatial structure. Using a fully connected network on images wastes parameters and ignores the fact that nearby pixels are related.
Sequential data (text, audio, time series): RNNs/LSTMs for moderate-length sequences, or transformers (Module 9) for long sequences. The choice depends on sequence length, computational budget, and whether you need real-time inference.
Tabular data (spreadsheets, structured databases): Gradient-boosted trees (XGBoost, LightGBM) often outperform deep learning on tabular data. This surprises many newcomers. Deep learning is not universally superior; it excels on unstructured data where feature engineering is hard.

Common misconception

“Deep learning is always better than traditional machine learning.”

Key takeaways

CNNs use convolution (sliding filters), pooling (dimension reduction), and depth (hierarchical abstraction) to learn spatial features directly from raw pixels. AlexNet demonstrated in 2012 that this approach could halve error rates compared to hand-crafted features.

RNNs process sequential data by maintaining a hidden state that carries information forward through the sequence. In theory this enables arbitrary-length memory; in practice, the vanishing gradient problem limits vanilla RNNs to short-range dependencies.

LSTMs solve the vanishing gradient problem with a cell state and three gates (forget, input, output) that control information flow using additive updates. They dominated sequence modelling from 2014 to 2017.

Architecture choice should match data structure: CNNs for spatial data, RNNs/LSTMs for sequences, gradient-boosted trees for tabular data. Using the wrong architecture forces the model to learn structure that the right one provides for free.

Deep learning is not universally superior. On tabular data with well-defined features, tree-based models frequently match or outperform neural networks while being faster and more interpretable. The right tool depends on the problem.

Standards and sources cited in this module

Krizhevsky, A., Sutskever, I. & Hinton, G.E., 'ImageNet Classification with Deep Convolutional Neural Networks', NeurIPS (2012)

Full paper

The AlexNet paper that triggered the deep learning revolution in computer vision. Demonstrated that deep CNNs trained on GPUs could dramatically outperform hand-crafted feature engineering. Used as the opening case study.

Hochreiter, S. & Schmidhuber, J., 'Long Short-Term Memory', Neural Computation (1997)

Sections 1-4

The original LSTM paper. Identified the vanishing gradient problem formally and introduced the cell state with gated information flow as the solution. This architecture dominated sequence modelling for two decades.

LeCun, Y., Bengio, Y. & Hinton, G., 'Deep learning', Nature (2015)

Full review article

Authoritative review by the three pioneers of deep learning. Provides historical context for CNNs and RNNs and explains why depth enables hierarchical feature learning. Essential reading for understanding the field.

Grinsztajn, L., Oyallon, E. & Varoquaux, G., 'Why do tree-based models still outperform deep learning on typical tabular data?', NeurIPS (2022)

Full paper

Rigorous benchmark demonstrating that gradient-boosted trees outperform deep learning on 45 tabular datasets. Challenges the assumption that deep learning is universally superior and establishes when tree-based models remain the better choice.

Goodfellow, I., Bengio, Y. & Courville, A., Deep Learning (2016)

Chapters 9 (CNNs) and 10 (RNNs)

The standard deep learning textbook. Chapter 9 provides the mathematical foundation for convolution, pooling, and stride. Chapter 10 covers RNN unfolding, backpropagation through time, and the gating mechanisms in LSTMs and GRUs.