Module 10 of 24 · Applied

Large language models

Understand GPT architecture, byte-pair encoding tokenisation, context windows, Chinchilla scaling laws, emergent capabilities, and the instruction tuning pipeline that turns a base model into an assistant.

By the end of this module you will be able to:

Describe GPT architecture as a decoder-only transformer trained with next-token prediction
Explain how BPE tokenisation converts raw text into sub-word tokens and why vocabulary size matters
Apply Chinchilla scaling laws to reason about compute-optimal model and dataset sizing
Distinguish pre-training, supervised fine-tuning, and RLHF in the instruction tuning pipeline

Person interacting with an AI chat interface on a laptop

OpenAI ChatGPT launch, 30 November 2022

One hundred million users in two months

On 30 November 2022, OpenAI released ChatGPT as a free research preview. Within five days it had one million users. By January 2023, it had over 100 million monthly active users, making it the fastest-growing consumer product in history. TikTok had taken nine months to reach the same milestone. Instagram took two and a half years.

ChatGPT was not a new model architecture. It was GPT-3.5, a decoder-only transformer fine-tuned with reinforcement learning from human feedback (RLHF). The base model had been trained on hundreds of billions of tokens of internet text. What changed was the interface: a simple chat box that made the model accessible to anyone who could type a question.

The launch forced every major technology company to accelerate its own LLM programme. Google declared a "code red" internally. Microsoft invested $10 billion in OpenAI. Meta open-sourced LLaMA. Within 18 months, the number of frontier-class language models went from one to more than a dozen.

GPT architecture: decoder-only transformer

GPT (Generative Pre-trained Transformer) uses the decoder half of the original transformer architecture. It processes text left-to-right with causal masking: each token can attend to itself and all previous tokens but not future tokens. The training objective is next-token prediction: given all preceding tokens, predict the probability distribution over the vocabulary for the next token.

A GPT model consists of an embedding layer (token embeddings plus positional embeddings), a stack of transformer blocks (each containing masked multi-head self-attention, a feed-forward network, layer normalisation, and residual connections), and a final linear layer that projects hidden states back to vocabulary size. GPT-3 has 96 transformer blocks, a model dimension of 12,288, 96 attention heads, and 175 billion parameters. The feed-forward network in each block has an inner dimension of 49,152 (4x the model dimension), which is where most of the parameters reside.

With an understanding of gpt architecture: decoder-only transformer in place, the discussion can now turn to tokenisation: byte-pair encoding, which builds directly on these foundations.

Tokenisation: byte-pair encoding

Language models do not process raw characters or whole words. They operate on tokens: sub-word units derived by a compression algorithm called Byte-Pair Encoding (BPE). BPE starts with individual bytes and iteratively merges the most frequent adjacent pair into a new token until the vocabulary reaches a target size (typically 32,000 to 100,000 tokens).

Common words become single tokens: "the" is one token. Uncommon words are split into sub-word pieces: "tokenisation" might become "token" + "isation". This means the model never encounters an out-of-vocabulary word because any string can be decomposed into known sub-word tokens. The trade-off: rare words consume more tokens, reducing the effective context window.

Vocabulary size has practical implications. A larger vocabulary means more words are single tokens (better compression, longer effective context) but the embedding table grows. GPT-2 used 50,257 tokens; GPT-4 uses approximately 100,000. For non-English languages, models trained primarily on English text produce inefficient tokenisations: a Japanese sentence might require 3-5x more tokens than the English equivalent, effectively shrinking the context window for those languages.

With an understanding of tokenisation: byte-pair encoding in place, the discussion can now turn to context windows and attention cost, which builds directly on these foundations.

Context windows and attention cost

The context window is the maximum number of tokens a model can process in a single forward pass. GPT-3 had a 2,048-token context window. GPT-4 launched with 8,192 tokens and later offered a 128,000-token variant. Claude 3 offers 200,000 tokens. Gemini 1.5 Pro supports up to 1 million tokens.

Longer context windows enable the model to reference more information within a single conversation, but the computational cost of self-attention scales quadratically with sequence length. Doubling the context window quadruples the attention computation. Techniques like sliding window attention, sparse attention, and Flash Attention make longer contexts tractable but do not eliminate the fundamental scaling relationship.

A common misconception is that context window size equals memory. Information at the beginning of a long context receives less attention weight on average than information near the query (the "lost in the middle" phenomenon). Retrieval-Augmented Generation, covered in Module 11, addresses this by placing only the most relevant information into the context.

With an understanding of context windows and attention cost in place, the discussion can now turn to scaling laws and chinchilla, which builds directly on these foundations.

“We find that the performance of language models scales as a power law with model size, dataset size, and the amount of compute used for training.”
Brown et al., 'Language Models are Few-Shot Learners' (GPT-3 paper), 2020

Scaling laws and Chinchilla

The Kaplan scaling laws (2020) showed that language model loss decreases predictably as a power law of model size, dataset size, and compute. This meant performance improvements could be planned in advance by allocating more compute. However, Kaplan's analysis suggested allocating most compute to larger models rather than more data.

The Chinchilla paper (Hoffmann et al., 2022) corrected this. DeepMind trained over 400 language models ranging from 70 million to 16 billion parameters on varying amounts of data and found that models and datasets should scale in proportion: for every doubling of model parameters, training tokens should also double. By this rule, GPT-3's 175 billion parameters should have been trained on approximately 3.5 trillion tokens rather than the 300 billion used.

Chinchilla (70B parameters, 1.4 trillion tokens) outperformed the much larger Gopher (280B parameters, 300 billion tokens) on most benchmarks despite using the same compute budget. The practical consequence: many early large models were undertrained. Meta's LLaMA (2023) applied Chinchilla scaling and trained a 65B-parameter model on 1.4 trillion tokens, achieving competitive performance with GPT-3 at less than half the parameter count.

With an understanding of scaling laws and chinchilla in place, the discussion can now turn to emergent capabilities, which builds directly on these foundations.

Common misconception

“Bigger models are always better. The most parameters wins.”

Chinchilla scaling laws show that a smaller model trained on proportionally more data outperforms a larger model trained on less data, given the same compute budget. GPT-3 (175B params, 300B tokens) was likely undertrained by Chinchilla standards. LLaMA-65B (65B params, 1.4T tokens) matched GPT-3's performance. The optimal strategy balances model size and training data quantity for a given compute budget.

Emergent capabilities

Emergence refers to capabilities that appear at scale without being explicitly trained. A model trained only to predict the next token develops abilities in arithmetic, translation, reasoning, and code generation that are absent in smaller versions trained the same way. These capabilities appear to emerge discontinuously: below a certain scale, performance on a task is near-random; above it, performance jumps sharply.

Examples include: few-shot learning (GPT-3 can perform tasks given only a few examples in the prompt, without any gradient updates), chain-of-thought reasoning (sufficiently large models produce more accurate answers when asked to show their working), and multilingual transfer (models trained predominantly on English text can answer questions in languages they saw relatively little of during training).

The existence and nature of emergence is debated. Schaeffer et al. (2023) argued that apparent emergence is partly an artefact of evaluation metrics: tasks scored with exact-match metrics appear to show sharp transitions, while the same tasks scored with continuous metrics show smooth improvement. The practical implication: do not assume a capability is absent just because a model of a given size fails at it; the capability may appear at the next scale increment.

With an understanding of emergent capabilities in place, the discussion can now turn to instruction tuning: pre-training, sft, and rlhf, which builds directly on these foundations.

Data centre servers representing the compute infrastructure required for LLM training — Training a frontier LLM requires thousands of GPUs running for weeks. The compute cost for GPT-4 is estimated at over $100 million.

Instruction tuning: pre-training, SFT, and RLHF

A base language model trained with next-token prediction is a text completer, not an assistant. It will continue any text pattern it is given, including harmful, inaccurate, or off-topic continuations. The instruction tuning pipeline transforms a base model into a useful assistant through three stages:

Pre-training: the model learns language structure, world knowledge, and reasoning patterns from trillions of tokens of internet text, books, and code. This stage consumes 99% of the total compute budget.
Supervised Fine-Tuning (SFT): human annotators write high-quality (instruction, response) pairs. The model is fine-tuned on these demonstrations, learning the format and style of helpful responses. Typically thousands to tens of thousands of examples.
RLHF (Reinforcement Learning from Human Feedback): human raters compare pairs of model responses and indicate which is better. A reward model is trained on these preferences, and the language model is optimised with PPO (Proximal Policy Optimisation) to maximise the reward model's score while staying close to the SFT checkpoint.

Alternatives to RLHF include DPO (Direct Preference Optimisation), which skips the separate reward model and optimises the language model directly on preference data, and RLAIF (RL from AI Feedback), where a stronger model provides the preference signal instead of human raters.

Loading interactive component...

Check your understanding

The Chinchilla scaling laws found that GPT-3 (175B parameters, 300B tokens) was likely undertrained. According to Chinchilla-optimal scaling, approximately how many training tokens should a 175B parameter model use?

During RLHF, human raters compare pairs of model responses. What is the role of the reward model trained on these comparisons?

Loading interactive component...

Check your understanding

GPT-4 has a context window of 128K tokens. A developer sends a prompt with 120K tokens of context and asks a question. The response quality is poor despite relevant information being in the context. What is the most likely explanation?

Key takeaways

GPT is a decoder-only transformer trained with next-token prediction. The architecture is conceptually simple: token embeddings, a stack of transformer blocks with causal masking, and a vocabulary projection head.
BPE tokenisation splits text into sub-word units. Common words are single tokens; rare words are decomposed. Non-English text often tokenises less efficiently, shrinking effective context windows.
Chinchilla scaling laws show that models and data should scale in proportion (roughly 20 tokens per parameter). Many early large models were undertrained relative to their parameter count.
Emergent capabilities (few-shot learning, chain-of-thought reasoning) appear at scale without explicit training. Whether emergence is a genuine phenomenon or a metric artefact remains debated.
The instruction tuning pipeline (pre-training, SFT, RLHF) transforms a base text completer into an assistant. RLHF uses a reward model trained on human preferences as the optimisation signal.

You now understand how large language models are built, scaled, and aligned. But a model's usefulness depends on what you put in the prompt. How do you structure prompts for reliable outputs, and how does RAG ground model responses in real data? Module 11 covers prompt engineering and retrieval-augmented generation.

Standards and sources cited in this module

Brown et al., 'Language Models are Few-Shot Learners' (GPT-3, 2020)
The GPT-3 paper demonstrating that scale enables few-shot learning without fine-tuning.
Hoffmann et al., 'Training Compute-Optimal Large Language Models' (Chinchilla, 2022)
DeepMind's scaling law analysis showing optimal data-to-parameter ratios.
Ouyang et al., 'Training Language Models to Follow Instructions with Human Feedback' (InstructGPT, 2022)
The RLHF methodology paper that underpins ChatGPT and modern instruction-tuned models.
Sennrich et al., 'Neural Machine Translation of Rare Words with Subword Units' (BPE, 2016)
The original BPE tokenisation paper that became the standard approach for language model vocabularies.

Loading lesson...

One hundred million users in two months

GPT architecture: decoder-only transformer

Tokenisation: byte-pair encoding

Context windows and attention cost

Scaling laws and Chinchilla

Emergent capabilities

Instruction tuning: pre-training, SFT, and RLHF

Pre-training: the model learns language structure, world knowledge, and reasoning patterns from trillions of tokens of internet text, books, and code. This stage consumes 99% of the total compute budget.
Supervised Fine-Tuning (SFT): human annotators write high-quality (instruction, response) pairs. The model is fine-tuned on these demonstrations, learning the format and style of helpful responses. Typically thousands to tens of thousands of examples.
RLHF (Reinforcement Learning from Human Feedback): human raters compare pairs of model responses and indicate which is better. A reward model is trained on these preferences, and the language model is optimised with PPO (Proximal Policy Optimisation) to maximise the reward model's score while staying close to the SFT checkpoint.

Key takeaways

GPT is a decoder-only transformer trained with next-token prediction. The architecture is conceptually simple: token embeddings, a stack of transformer blocks with causal masking, and a vocabulary projection head.

BPE tokenisation splits text into sub-word units. Common words are single tokens; rare words are decomposed. Non-English text often tokenises less efficiently, shrinking effective context windows.

Chinchilla scaling laws show that models and data should scale in proportion (roughly 20 tokens per parameter). Many early large models were undertrained relative to their parameter count.

Emergent capabilities (few-shot learning, chain-of-thought reasoning) appear at scale without explicit training. Whether emergence is a genuine phenomenon or a metric artefact remains debated.

The instruction tuning pipeline (pre-training, SFT, RLHF) transforms a base text completer into an assistant. RLHF uses a reward model trained on human preferences as the optimisation signal.