Loading lesson...
Loading lesson...
Understand GPT architecture, byte-pair encoding tokenisation, context windows, Chinchilla scaling laws, emergent capabilities, and the instruction tuning pipeline that turns a base model into an assistant.
By the end of this module you will be able to:

OpenAI ChatGPT launch, 30 November 2022
On 30 November 2022, OpenAI released ChatGPT as a free research preview. Within five days it had one million users. By January 2023, it had over 100 million monthly active users, making it the fastest-growing consumer product in history. TikTok had taken nine months to reach the same milestone. Instagram took two and a half years.
ChatGPT was not a new model architecture. It was GPT-3.5, a decoder-only transformer fine-tuned with reinforcement learning from human feedback (RLHF). The base model had been trained on hundreds of billions of tokens of internet text. What changed was the interface: a simple chat box that made the model accessible to anyone who could type a question.
The launch forced every major technology company to accelerate its own LLM programme. Google declared a "code red" internally. Microsoft invested $10 billion in OpenAI. Meta open-sourced LLaMA. Within 18 months, the number of frontier-class language models went from one to more than a dozen.
GPT (Generative Pre-trained Transformer) uses the decoder half of the original transformer architecture. It processes text left-to-right with causal masking: each token can attend to itself and all previous tokens but not future tokens. The training objective is next-token prediction: given all preceding tokens, predict the probability distribution over the vocabulary for the next token.
A GPT model consists of an embedding layer (token embeddings plus positional embeddings), a stack of transformer blocks (each containing masked multi-head self-attention, a feed-forward network, layer normalisation, and residual connections), and a final linear layer that projects hidden states back to vocabulary size. GPT-3 has 96 transformer blocks, a model dimension of 12,288, 96 attention heads, and 175 billion parameters. The feed-forward network in each block has an inner dimension of 49,152 (4x the model dimension), which is where most of the parameters reside.
With an understanding of gpt architecture: decoder-only transformer in place, the discussion can now turn to tokenisation: byte-pair encoding, which builds directly on these foundations.
Language models do not process raw characters or whole words. They operate on tokens: sub-word units derived by a compression algorithm called Byte-Pair Encoding (BPE). BPE starts with individual bytes and iteratively merges the most frequent adjacent pair into a new token until the vocabulary reaches a target size (typically 32,000 to 100,000 tokens).
Common words become single tokens: "the" is one token. Uncommon words are split into sub-word pieces: "tokenisation" might become "token" + "isation". This means the model never encounters an out-of-vocabulary word because any string can be decomposed into known sub-word tokens. The trade-off: rare words consume more tokens, reducing the effective context window.
Vocabulary size has practical implications. A larger vocabulary means more words are single tokens (better compression, longer effective context) but the embedding table grows. GPT-2 used 50,257 tokens; GPT-4 uses approximately 100,000. For non-English languages, models trained primarily on English text produce inefficient tokenisations: a Japanese sentence might require 3-5x more tokens than the English equivalent, effectively shrinking the context window for those languages.
With an understanding of tokenisation: byte-pair encoding in place, the discussion can now turn to context windows and attention cost, which builds directly on these foundations.
The context window is the maximum number of tokens a model can process in a single forward pass. GPT-3 had a 2,048-token context window. GPT-4 launched with 8,192 tokens and later offered a 128,000-token variant. Claude 3 offers 200,000 tokens. Gemini 1.5 Pro supports up to 1 million tokens.
Longer context windows enable the model to reference more information within a single conversation, but the computational cost of self-attention scales quadratically with sequence length. Doubling the context window quadruples the attention computation. Techniques like sliding window attention, sparse attention, and Flash Attention make longer contexts tractable but do not eliminate the fundamental scaling relationship.
A common misconception is that context window size equals memory. Information at the beginning of a long context receives less attention weight on average than information near the query (the "lost in the middle" phenomenon). Retrieval-Augmented Generation, covered in Module 11, addresses this by placing only the most relevant information into the context.
With an understanding of context windows and attention cost in place, the discussion can now turn to scaling laws and chinchilla, which builds directly on these foundations.
“We find that the performance of language models scales as a power law with model size, dataset size, and the amount of compute used for training.”
Brown et al., 'Language Models are Few-Shot Learners' (GPT-3 paper), 2020
The Kaplan scaling laws (2020) showed that language model loss decreases predictably as a power law of model size, dataset size, and compute. This meant performance improvements could be planned in advance by allocating more compute. However, Kaplan's analysis suggested allocating most compute to larger models rather than more data.
The Chinchilla paper (Hoffmann et al., 2022) corrected this. DeepMind trained over 400 language models ranging from 70 million to 16 billion parameters on varying amounts of data and found that models and datasets should scale in proportion: for every doubling of model parameters, training tokens should also double. By this rule, GPT-3's 175 billion parameters should have been trained on approximately 3.5 trillion tokens rather than the 300 billion used.
Chinchilla (70B parameters, 1.4 trillion tokens) outperformed the much larger Gopher (280B parameters, 300 billion tokens) on most benchmarks despite using the same compute budget. The practical consequence: many early large models were undertrained. Meta's LLaMA (2023) applied Chinchilla scaling and trained a 65B-parameter model on 1.4 trillion tokens, achieving competitive performance with GPT-3 at less than half the parameter count.
With an understanding of scaling laws and chinchilla in place, the discussion can now turn to emergent capabilities, which builds directly on these foundations.
Common misconception
“Bigger models are always better. The most parameters wins.”
Chinchilla scaling laws show that a smaller model trained on proportionally more data outperforms a larger model trained on less data, given the same compute budget. GPT-3 (175B params, 300B tokens) was likely undertrained by Chinchilla standards. LLaMA-65B (65B params, 1.4T tokens) matched GPT-3's performance. The optimal strategy balances model size and training data quantity for a given compute budget.
Emergence refers to capabilities that appear at scale without being explicitly trained. A model trained only to predict the next token develops abilities in arithmetic, translation, reasoning, and code generation that are absent in smaller versions trained the same way. These capabilities appear to emerge discontinuously: below a certain scale, performance on a task is near-random; above it, performance jumps sharply.
Examples include: few-shot learning (GPT-3 can perform tasks given only a few examples in the prompt, without any gradient updates), chain-of-thought reasoning (sufficiently large models produce more accurate answers when asked to show their working), and multilingual transfer (models trained predominantly on English text can answer questions in languages they saw relatively little of during training).
The existence and nature of emergence is debated. Schaeffer et al. (2023) argued that apparent emergence is partly an artefact of evaluation metrics: tasks scored with exact-match metrics appear to show sharp transitions, while the same tasks scored with continuous metrics show smooth improvement. The practical implication: do not assume a capability is absent just because a model of a given size fails at it; the capability may appear at the next scale increment.
With an understanding of emergent capabilities in place, the discussion can now turn to instruction tuning: pre-training, sft, and rlhf, which builds directly on these foundations.
A base language model trained with next-token prediction is a text completer, not an assistant. It will continue any text pattern it is given, including harmful, inaccurate, or off-topic continuations. The instruction tuning pipeline transforms a base model into a useful assistant through three stages:
Alternatives to RLHF include DPO (Direct Preference Optimisation), which skips the separate reward model and optimises the language model directly on preference data, and RLAIF (RL from AI Feedback), where a stronger model provides the preference signal instead of human raters.
The Chinchilla scaling laws found that GPT-3 (175B parameters, 300B tokens) was likely undertrained. According to Chinchilla-optimal scaling, approximately how many training tokens should a 175B parameter model use?
During RLHF, human raters compare pairs of model responses. What is the role of the reward model trained on these comparisons?
GPT-4 has a context window of 128K tokens. A developer sends a prompt with 120K tokens of context and asks a question. The response quality is poor despite relevant information being in the context. What is the most likely explanation?
You now understand how large language models are built, scaled, and aligned. But a model's usefulness depends on what you put in the prompt. How do you structure prompts for reliable outputs, and how does RAG ground model responses in real data? Module 11 covers prompt engineering and retrieval-augmented generation.
Brown et al., 'Language Models are Few-Shot Learners' (GPT-3, 2020)
The GPT-3 paper demonstrating that scale enables few-shot learning without fine-tuning.
Hoffmann et al., 'Training Compute-Optimal Large Language Models' (Chinchilla, 2022)
DeepMind's scaling law analysis showing optimal data-to-parameter ratios.
The RLHF methodology paper that underpins ChatGPT and modern instruction-tuned models.
Sennrich et al., 'Neural Machine Translation of Rare Words with Subword Units' (BPE, 2016)
The original BPE tokenisation paper that became the standard approach for language model vocabularies.