Module 9 of 24 · Applied

Transformers and attention

Understand how the self-attention mechanism, positional encoding, and multi-head attention combine to form the transformer architecture that replaced recurrent neural networks and now powers virtually every frontier AI system.

By the end of this module you will be able to:

Explain the self-attention mechanism and why it eliminates the sequential bottleneck of RNNs
Describe how positional encoding preserves token order in a set-based architecture
Trace a forward pass through a multi-head attention block and identify Q, K, V matrices
Compare encoder-only, decoder-only, and encoder-decoder transformer variants and their use cases

$Focused beam of light refracting through a prism, illustrating selective attention$

Google Brain, June 2017

Eight researchers, one paper, a paradigm shift

In June 2017, eight researchers at Google published a paper with a title that read more like a thesis statement than a headline. The paper proposed replacing recurrence and convolutions entirely with a mechanism called self-attention, allowing the model to relate every token in a sequence to every other token in a single parallel step.

The results were immediate: the transformer outperformed the best existing models on English-to-German and English-to-French translation benchmarks while training in a fraction of the time. Recurrent neural networks processed tokens one at a time, creating an information bottleneck for long sequences. The transformer processed the entire sequence simultaneously, and its attention weights revealed which tokens the model considered relevant to each other.

Within two years, the transformer architecture spawned BERT (encoder-only, 2018), GPT-2 (decoder-only, 2019), and T5 (encoder-decoder, 2019). Every frontier language model, image generator, protein folder, and code assistant released since traces its lineage to that single paper.

Why RNNs hit a wall

Recurrent neural networks process sequences one token at a time, maintaining a hidden state that carries information forward. This sequential processing creates two fundamental limitations. First, training cannot be parallelised across the sequence dimension because each step depends on the previous hidden state. Second, information from early tokens must survive through every intermediate hidden state to influence later tokens, creating a bottleneck that degrades with sequence length.

Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) added gating mechanisms to selectively preserve or forget information, partially mitigating the vanishing gradient problem. But they could not solve the parallelisation bottleneck. Training an LSTM on a sequence of 1,000 tokens requires 1,000 sequential steps regardless of how many GPUs are available. The transformer processes all 1,000 tokens simultaneously.

With an understanding of why rnns hit a wall in place, the discussion can now turn to self-attention: relating every token to every other, which builds directly on these foundations.

Self-attention: relating every token to every other

Self-attention computes a weighted sum of all tokens in a sequence for each output position. For a given token, the mechanism asks: "How relevant is every other token in the sequence to understanding this one?" The answer is computed through three learned linear projections:

Query (Q): represents what this token is looking for
Key (K): represents what this token can offer to other tokens
Value (V): represents the actual information content this token contributes

The attention score between two tokens is the dot product of the query of the first and the key of the second, scaled by the square root of the key dimension (to prevent dot products from growing too large and pushing softmax into saturation). The scores across all keys are passed through softmax to produce a probability distribution, then used to weight the values. The formula is:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

This is computed as a single matrix multiplication, making it fully parallelisable across all positions in the sequence. An RNN would need to process the same relationships sequentially.

With an understanding of self-attention: relating every token to every other in place, the discussion can now turn to positional encoding, which builds directly on these foundations.

Positional encoding

Self-attention treats its input as a set, not a sequence. Without positional information, the sentence "the cat sat on the mat" and "mat the on sat cat the" would produce identical attention patterns. Positional encoding injects sequence order by adding a position-dependent vector to each token embedding before it enters the attention layers.

The original transformer used fixed sinusoidal encodings with different frequencies for each dimension: sine functions for even dimensions, cosine for odd dimensions. Each position gets a unique signature, and the model can learn to attend to relative positions because the encoding of position p+k can be expressed as a linear function of the encoding at position p. Modern variants like RoPE (Rotary Position Embedding) encode relative position directly into the attention computation, enabling better length generalisation.

With an understanding of positional encoding in place, the discussion can now turn to multi-head attention, which builds directly on these foundations.

Multi-head attention

A single attention head captures one type of relationship between tokens. The word "bank" in "river bank" and "bank account" requires different contextual cues. Multi-head attention runs several independent attention computations in parallel, each with its own Q, K, V projections, then concatenates and linearly projects the results.

If the model dimension is 512 and there are 8 heads, each head operates on a 64-dimensional subspace. One head might learn to attend to syntactic relationships (subject-verb agreement), another to semantic similarity, another to positional proximity. The concatenated output captures all these patterns simultaneously. The original transformer used 8 heads; GPT-3 uses 96 heads across a 12,288-dimensional model.

With an understanding of multi-head attention in place, the discussion can now turn to encoder-decoder architecture, which builds directly on these foundations.

“The dominant sequence transduction models are based on complex recurrent or convolutional neural networks. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.”
Vaswani et al., 'Attention Is All You Need', 2017 - Section 3.2, Attention

Encoder-decoder architecture

The original transformer has two halves. The encoder processes the entire input sequence with bidirectional self-attention (every token attends to every other token). The decoder generates output tokens one at a time using masked self-attention (each token can only attend to previous tokens) plus cross-attention to the encoder output.

This encoder-decoder design suits sequence-to-sequence tasks like translation, where the full source sentence must be understood before generating the target. Three architectural variants emerged:

Encoder-only (BERT, 2018): bidirectional attention over the full input. Suited for classification, named entity recognition, and semantic similarity. Pre-trained with masked language modelling (predict missing tokens).
Decoder-only (GPT series, 2018-present): causal (left-to-right) attention only. Suited for text generation, code completion, and dialogue. Pre-trained with next-token prediction.
Encoder-decoder (T5, BART): the original design. Suited for translation, summarisation, and tasks where the full input informs every output token.

The decoder-only variant dominates current frontier models (GPT-4, Claude, Gemini, Llama) because next-token prediction scales efficiently and generalises well when combined with instruction tuning and reinforcement learning from human feedback.

With an understanding of encoder-decoder architecture in place, the discussion can now turn to layer normalisation and residual connections, which builds directly on these foundations.

Common misconception

“Transformers understand language the way humans do.”

Transformers learn statistical patterns in token sequences. Self-attention computes weighted relevance scores between tokens based on learned projections, not semantic comprehension. A transformer can produce fluent text about quantum physics without any physical understanding. The attention mechanism is a mathematical operation on numerical vectors, not a reasoning process. This distinction matters when evaluating model outputs for factual reliability.

Transformer architecture processing tokens simultaneously through self-attention for parallelised training — The transformer architecture processes all tokens simultaneously through self-attention, enabling parallelised training at unprecedented scale.

Layer normalisation and residual connections

Each sub-layer in the transformer (self-attention and feed-forward) is wrapped with a residual connection and followed by layer normalisation. The residual connection adds the sub-layer input to its output, creating a shortcut path that allows gradients to flow directly through the network during backpropagation. This is critical for training deep networks: the original transformer has 6 encoder and 6 decoder layers; GPT-3 has 96 layers.

Layer normalisation stabilises training by normalising activations across the feature dimension for each token independently. Pre-norm variants (normalise before the sub-layer rather than after) have become standard in large models because they produce more stable training dynamics at scale.

With an understanding of layer normalisation and residual connections in place, the discussion can now turn to computational cost and the quadratic bottleneck, which builds directly on these foundations.

Computational cost and the quadratic bottleneck

Self-attention has O(n²) computational complexity with respect to sequence length because every token attends to every other token. For a sequence of 1,000 tokens, the attention matrix has 1,000,000 entries. For 100,000 tokens, it has 10 billion entries. This quadratic scaling is the primary constraint on context window size in current models.

Several approaches address this: sparse attention patterns (Longformer, BigBird) that restrict attention to local windows and selected global tokens; linear attention approximations (Performer) that use kernel methods to avoid computing the full attention matrix; and sliding window attention with sink tokens (Mistral) that maintains a fixed memory budget. Flash Attention optimises the memory access pattern of standard attention rather than changing the algorithm, achieving 2-4x speedups by reducing reads and writes to GPU high-bandwidth memory.

Loading interactive component...

Check your understanding

In the self-attention formula Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V, what is the purpose of dividing by the square root of d_k?

An encoder-only transformer like BERT uses bidirectional self-attention, while a decoder-only model like GPT uses causal (masked) attention. What practical consequence does this have for the tasks each architecture suits?

Loading interactive component...

Check your understanding

Standard self-attention has O(n^2) time and memory complexity with respect to sequence length n. A model needs to process a document of 100,000 tokens. What architectural approach addresses the quadratic bottleneck?

Key takeaways

Self-attention computes weighted relationships between all token pairs simultaneously, eliminating the sequential bottleneck of RNNs. The Q, K, V projections learn what to look for, what to advertise, and what information to contribute.
Positional encoding (sinusoidal or rotary) injects sequence order into a set-based architecture. Without it, the model cannot distinguish word order.
Multi-head attention runs parallel attention computations in separate subspaces, allowing the model to capture syntactic, semantic, and positional relationships simultaneously.
Three architectural variants emerged: encoder-only (BERT, classification), decoder-only (GPT, generation), and encoder-decoder (T5, translation). Decoder-only dominates frontier models.
Self-attention has O(n-squared) complexity with sequence length. Flash Attention, sparse attention, and sliding window approaches address this scaling constraint.

You now understand the architecture that underpins every frontier AI system. The transformer is a general-purpose sequence processor, but it was the application of massive scale that turned it into something qualitatively different. How do you train a transformer with 175 billion parameters, and what emergent capabilities appear at scale? Module 10 covers large language models.

Standards and sources cited in this module

Vaswani et al., 'Attention Is All You Need' (2017)
The foundational paper introducing the transformer architecture and self-attention mechanism.
Devlin et al., 'BERT: Pre-training of Deep Bidirectional Transformers' (2018)
Encoder-only transformer that established bidirectional pre-training for NLP understanding tasks.
Jay Alammar, 'The Illustrated Transformer' (2018)
Visual walkthrough of the transformer architecture that remains the most accessible technical explanation available.
Dao et al., 'FlashAttention: Fast and Memory-Efficient Exact Attention' (2022)
Hardware-aware attention algorithm that achieves 2-4x speedups without approximation.

Loading lesson...