Loading lesson...
Loading lesson...
Understand how the self-attention mechanism, positional encoding, and multi-head attention combine to form the transformer architecture that replaced recurrent neural networks and now powers virtually every frontier AI system.
By the end of this module you will be able to:

Google Brain, June 2017
In June 2017, eight researchers at Google published a paper with a title that read more like a thesis statement than a headline. The paper proposed replacing recurrence and convolutions entirely with a mechanism called self-attention, allowing the model to relate every token in a sequence to every other token in a single parallel step.
The results were immediate: the transformer outperformed the best existing models on English-to-German and English-to-French translation benchmarks while training in a fraction of the time. Recurrent neural networks processed tokens one at a time, creating an information bottleneck for long sequences. The transformer processed the entire sequence simultaneously, and its attention weights revealed which tokens the model considered relevant to each other.
Within two years, the transformer architecture spawned BERT (encoder-only, 2018), GPT-2 (decoder-only, 2019), and T5 (encoder-decoder, 2019). Every frontier language model, image generator, protein folder, and code assistant released since traces its lineage to that single paper.
Recurrent neural networks process sequences one token at a time, maintaining a hidden state that carries information forward. This sequential processing creates two fundamental limitations. First, training cannot be parallelised across the sequence dimension because each step depends on the previous hidden state. Second, information from early tokens must survive through every intermediate hidden state to influence later tokens, creating a bottleneck that degrades with sequence length.
Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) added gating mechanisms to selectively preserve or forget information, partially mitigating the vanishing gradient problem. But they could not solve the parallelisation bottleneck. Training an LSTM on a sequence of 1,000 tokens requires 1,000 sequential steps regardless of how many GPUs are available. The transformer processes all 1,000 tokens simultaneously.
With an understanding of why rnns hit a wall in place, the discussion can now turn to self-attention: relating every token to every other, which builds directly on these foundations.
Self-attention computes a weighted sum of all tokens in a sequence for each output position. For a given token, the mechanism asks: "How relevant is every other token in the sequence to understanding this one?" The answer is computed through three learned linear projections:
The attention score between two tokens is the dot product of the query of the first and the key of the second, scaled by the square root of the key dimension (to prevent dot products from growing too large and pushing softmax into saturation). The scores across all keys are passed through softmax to produce a probability distribution, then used to weight the values. The formula is:
Attention(Q, K, V) = softmax(QKT / √dk) V
This is computed as a single matrix multiplication, making it fully parallelisable across all positions in the sequence. An RNN would need to process the same relationships sequentially.
With an understanding of self-attention: relating every token to every other in place, the discussion can now turn to positional encoding, which builds directly on these foundations.
Self-attention treats its input as a set, not a sequence. Without positional information, the sentence "the cat sat on the mat" and "mat the on sat cat the" would produce identical attention patterns. Positional encoding injects sequence order by adding a position-dependent vector to each token embedding before it enters the attention layers.
The original transformer used fixed sinusoidal encodings with different frequencies for each dimension: sine functions for even dimensions, cosine for odd dimensions. Each position gets a unique signature, and the model can learn to attend to relative positions because the encoding of position p+k can be expressed as a linear function of the encoding at position p. Modern variants like RoPE (Rotary Position Embedding) encode relative position directly into the attention computation, enabling better length generalisation.
With an understanding of positional encoding in place, the discussion can now turn to multi-head attention, which builds directly on these foundations.
A single attention head captures one type of relationship between tokens. The word "bank" in "river bank" and "bank account" requires different contextual cues. Multi-head attention runs several independent attention computations in parallel, each with its own Q, K, V projections, then concatenates and linearly projects the results.
If the model dimension is 512 and there are 8 heads, each head operates on a 64-dimensional subspace. One head might learn to attend to syntactic relationships (subject-verb agreement), another to semantic similarity, another to positional proximity. The concatenated output captures all these patterns simultaneously. The original transformer used 8 heads; GPT-3 uses 96 heads across a 12,288-dimensional model.
With an understanding of multi-head attention in place, the discussion can now turn to encoder-decoder architecture, which builds directly on these foundations.
“The dominant sequence transduction models are based on complex recurrent or convolutional neural networks. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.”
Vaswani et al., 'Attention Is All You Need', 2017 - Section 3.2, Attention
The original transformer has two halves. The encoder processes the entire input sequence with bidirectional self-attention (every token attends to every other token). The decoder generates output tokens one at a time using masked self-attention (each token can only attend to previous tokens) plus cross-attention to the encoder output.
This encoder-decoder design suits sequence-to-sequence tasks like translation, where the full source sentence must be understood before generating the target. Three architectural variants emerged:
The decoder-only variant dominates current frontier models (GPT-4, Claude, Gemini, Llama) because next-token prediction scales efficiently and generalises well when combined with instruction tuning and reinforcement learning from human feedback.
With an understanding of encoder-decoder architecture in place, the discussion can now turn to layer normalisation and residual connections, which builds directly on these foundations.
Common misconception
“Transformers understand language the way humans do.”
Transformers learn statistical patterns in token sequences. Self-attention computes weighted relevance scores between tokens based on learned projections, not semantic comprehension. A transformer can produce fluent text about quantum physics without any physical understanding. The attention mechanism is a mathematical operation on numerical vectors, not a reasoning process. This distinction matters when evaluating model outputs for factual reliability.
Each sub-layer in the transformer (self-attention and feed-forward) is wrapped with a residual connection and followed by layer normalisation. The residual connection adds the sub-layer input to its output, creating a shortcut path that allows gradients to flow directly through the network during backpropagation. This is critical for training deep networks: the original transformer has 6 encoder and 6 decoder layers; GPT-3 has 96 layers.
Layer normalisation stabilises training by normalising activations across the feature dimension for each token independently. Pre-norm variants (normalise before the sub-layer rather than after) have become standard in large models because they produce more stable training dynamics at scale.
With an understanding of layer normalisation and residual connections in place, the discussion can now turn to computational cost and the quadratic bottleneck, which builds directly on these foundations.
Self-attention has O(n2) computational complexity with respect to sequence length because every token attends to every other token. For a sequence of 1,000 tokens, the attention matrix has 1,000,000 entries. For 100,000 tokens, it has 10 billion entries. This quadratic scaling is the primary constraint on context window size in current models.
Several approaches address this: sparse attention patterns (Longformer, BigBird) that restrict attention to local windows and selected global tokens; linear attention approximations (Performer) that use kernel methods to avoid computing the full attention matrix; and sliding window attention with sink tokens (Mistral) that maintains a fixed memory budget. Flash Attention optimises the memory access pattern of standard attention rather than changing the algorithm, achieving 2-4x speedups by reducing reads and writes to GPU high-bandwidth memory.
In the self-attention formula Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V, what is the purpose of dividing by the square root of d_k?
An encoder-only transformer like BERT uses bidirectional self-attention, while a decoder-only model like GPT uses causal (masked) attention. What practical consequence does this have for the tasks each architecture suits?
Standard self-attention has O(n^2) time and memory complexity with respect to sequence length n. A model needs to process a document of 100,000 tokens. What architectural approach addresses the quadratic bottleneck?
You now understand the architecture that underpins every frontier AI system. The transformer is a general-purpose sequence processor, but it was the application of massive scale that turned it into something qualitatively different. How do you train a transformer with 175 billion parameters, and what emergent capabilities appear at scale? Module 10 covers large language models.
Vaswani et al., 'Attention Is All You Need' (2017)
The foundational paper introducing the transformer architecture and self-attention mechanism.
Devlin et al., 'BERT: Pre-training of Deep Bidirectional Transformers' (2018)
Encoder-only transformer that established bidirectional pre-training for NLP understanding tasks.
Jay Alammar, 'The Illustrated Transformer' (2018)
Visual walkthrough of the transformer architecture that remains the most accessible technical explanation available.
Dao et al., 'FlashAttention: Fast and Memory-Efficient Exact Attention' (2022)
Hardware-aware attention algorithm that achieves 2-4x speedups without approximation.