Transformer Architecture: Attention Is All You Need

12 June 2017.Artificial intelligence.Paradigm shift.Date precision, exact.Evidence grade, primary.2 primary sources

Drivers:

Research breakthroughTechnological capability

Attention mechanisms had shown promise. Hardware advances made parallel training practical. The desire for better machine translation drove research investment.

The Transformer is a type of AI architecture introduced in 2017 that can process language much more efficiently than previous methods. Instead of reading words one at a time, it can look at all words simultaneously and figure out how they relate to each other. This breakthrough made possible the chatbots and AI assistants we use today, including ChatGPT.

Transformer Architecture: Attention Is All You Need event plate

Structured atlas record showing date, domain, evidence grade, source count, and predecessor and successor links.

Forecasts and counterfactuals stay labelled as opinion in the event data. Source: Computer History Museum.

Before

Sequence models (RNNs, LSTMs) processed input sequentially, limiting parallelisation and making it difficult to capture long-range dependencies. Training on long sequences was slow and gradient flow was problematic. Machine translation quality had plateaued.

What changed

The Transformer architecture replaced recurrence with self-attention, enabling parallel processing of entire sequences. This dramatically improved training speed and model quality. The architecture became the foundation for GPT, BERT, and virtually all modern large language models.

How it happened

Researchers at Google published 'Attention Is All You Need' in June 2017. The paper introduced multi-head self-attention, positional encoding, and the encoder-decoder Transformer structure. The model achieved state-of-the-art translation quality while training faster than RNN-based systems.

Outcomes

Enabled training of very large language models
Became foundation for GPT, BERT, and successors
Revolutionised NLP and extended to vision, audio, etc.
Enabled practical conversational AI systems

Limitations

Quadratic complexity in sequence length
Large models require massive compute and data
Positional encoding limits length generalisation
Attention patterns can be difficult to interpret

Lessons learnt

Architectural innovations can transform entire fields
Parallelisation enables scale
Attention mechanisms are remarkably versatile
Simple, general architectures can outperform specialised ones

Stakeholders and artefacts

Organisations

Google BrainvendorResearch team
Google ResearchvendorResearch team

Individuals

Ashish VaswaniLead author, Google BrainCo-invented Transformer architecture
Noam ShazeerCo-author, GoogleCo-invented Transformer, later co-founded Character.AI
Illia PolosukhinCo-author, GoogleCo-invented Transformer, later co-founded NEAR Protocol

Artefacts

TransformerspecificationNeural architecture based on self-attention
Self-attentionmethodologyMechanism relating different positions in a sequence
Multi-head attentionmethodologyParallel attention with multiple learned projections

Key terms

Transformerattentionself-attentionBERTGPTencoder-decoder

Causality

Preceded by: AlexNet Wins ImageNet: Deep Learning Revolution Begins.

Made possible: Large Language Models: GPT and the Scaling Era.

On this course

Read in the path AI: From Turing to Transformers.

Sources

1Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. "Attention Is All You Need". Google, 2017-06-12.peer reviewedarxiv.org/abs/1706.03762

2Google Research. "Attention is All You Need". Google Research, 2017.authoritativeresearch.google/pubs/pub46201