Transformer Architecture: Attention Is All You Need
12 June 2017Artificial intelligenceParadigm shiftDate precision, exactEvidence grade, primary2 primary sources
Drivers:
Attention mechanisms had shown promise. Hardware advances made parallel training practical. The desire for better machine translation drove research investment.
The Transformer is a type of AI architecture introduced in 2017 that can process language much more efficiently than previous methods. Instead of reading words one at a time, it can look at all words simultaneously and figure out how they relate to each other. This breakthrough made possible the chatbots and AI assistants we use today, including ChatGPT.
Transformer Architecture: Attention Is All You Need event plate
Structured atlas record showing date, domain, evidence grade, source count, and predecessor and successor links.
Forecasts and counterfactuals stay labelled as opinion in the event data. Source: Computer History Museum.
Before
Sequence models (RNNs, LSTMs) processed input sequentially, limiting parallelisation and making it difficult to capture long-range dependencies. Training on long sequences was slow and gradient flow was problematic. Machine translation quality had plateaued.
What changed
The Transformer architecture replaced recurrence with self-attention, enabling parallel processing of entire sequences. This dramatically improved training speed and model quality. The architecture became the foundation for GPT, BERT, and virtually all modern large language models.
How it happened
Researchers at Google published 'Attention Is All You Need' in June 2017. The paper introduced multi-head self-attention, positional encoding, and the encoder-decoder Transformer structure. The model achieved state-of-the-art translation quality while training faster than RNN-based systems.
Outcomes
- Enabled training of very large language models
- Became foundation for GPT, BERT, and successors
- Revolutionised NLP and extended to vision, audio, etc.
- Enabled practical conversational AI systems
Limitations
- Quadratic complexity in sequence length
- Large models require massive compute and data
- Positional encoding limits length generalisation
- Attention patterns can be difficult to interpret
Lessons learnt
- Architectural innovations can transform entire fields
- Parallelisation enables scale
- Attention mechanisms are remarkably versatile
- Simple, general architectures can outperform specialised ones
Stakeholders and artefacts
Organisations
- Google BrainvendorResearch team
- Google ResearchvendorResearch team
Individuals
- Ashish VaswaniLead author, Google BrainCo-invented Transformer architecture
- Noam ShazeerCo-author, GoogleCo-invented Transformer, later co-founded Character.AI
- Illia PolosukhinCo-author, GoogleCo-invented Transformer, later co-founded NEAR Protocol
Artefacts
- TransformerspecificationNeural architecture based on self-attention
- Self-attentionmethodologyMechanism relating different positions in a sequence
- Multi-head attentionmethodologyParallel attention with multiple learned projections
Key terms
Causality
Preceded by: AlexNet Wins ImageNet: Deep Learning Revolution Begins.
Made possible: Large Language Models: GPT and the Scaling Era.
On this course
Read in the path AI: From Turing to Transformers.