Skip to content

Transformer Architecture: Attention Is All You Need

12 June 2017.Artificial intelligence.Paradigm shift.Date precision, exact.Evidence grade, primary.2 primary sources

Drivers:

Research breakthroughTechnological capability

Attention mechanisms had shown promise. Hardware advances made parallel training practical. The desire for better machine translation drove research investment.

The Transformer is a type of AI architecture introduced in 2017 that can process language much more efficiently than previous methods. Instead of reading words one at a time, it can look at all words simultaneously and figure out how they relate to each other. This breakthrough made possible the chatbots and AI assistants we use today, including ChatGPT.

Transformer Architecture: Attention Is All You Need event plate

Structured atlas record showing date, domain, evidence grade, source count, and predecessor and successor links.

Event plate: Transformer architecture, Attention Is All You Need Convergence-divergence layout. The central hero card carries the event year, type, title, evidence grade, domain and era band. 2 predecessor cards on the left feed in with red arrows labelled "absorbs". 3 successor cards on the right derive with red arrows labelled "spawns". Key terms below the hero pin the vocabulary the event introduced. EVENT PLATE Source: arxiv.org/abs/1706.03762 absorbs PRECEDES 2014 Sequence to sequence learning PRECEDES 2014 Neural machine translation byjoint learning 2017 - PUBLICATION MILESTONE Transformer architecture,Attention Is All You Need Primary, peer-reviewed Domain: AI and machine learning Era band: E6 AI-scale systems spawns SPAWNS 2018 BERT, bidirectional transformers SPAWNS 2020 GPT-3 large language model SPAWNS 2022 Diffusion models reach production KEY TERMS - VOCABULARY THE EVENT INTRODUCED self-attention encoder-decoder positional encoding multi-head Convergence-divergence: predecessors absorbed, successors spawned Hero card carries year, evidence and domain. 2 predecessors flow in from the left; 3 successors flow out to the right. Key termsbelow pin the vocabulary the event introduced.

Forecasts and counterfactuals stay labelled as opinion in the event data. Source: Computer History Museum.

Before

Sequence models (RNNs, LSTMs) processed input sequentially, limiting parallelisation and making it difficult to capture long-range dependencies. Training on long sequences was slow and gradient flow was problematic. Machine translation quality had plateaued.

What changed

The Transformer architecture replaced recurrence with self-attention, enabling parallel processing of entire sequences. This dramatically improved training speed and model quality. The architecture became the foundation for GPT, BERT, and virtually all modern large language models.

How it happened

Researchers at Google published 'Attention Is All You Need' in June 2017. The paper introduced multi-head self-attention, positional encoding, and the encoder-decoder Transformer structure. The model achieved state-of-the-art translation quality while training faster than RNN-based systems.

Outcomes

  • Enabled training of very large language models
  • Became foundation for GPT, BERT, and successors
  • Revolutionised NLP and extended to vision, audio, etc.
  • Enabled practical conversational AI systems

Limitations

  • Quadratic complexity in sequence length
  • Large models require massive compute and data
  • Positional encoding limits length generalisation
  • Attention patterns can be difficult to interpret

Lessons learnt

  • Architectural innovations can transform entire fields
  • Parallelisation enables scale
  • Attention mechanisms are remarkably versatile
  • Simple, general architectures can outperform specialised ones

Stakeholders and artefacts

Organisations

  • Google BrainvendorResearch team
  • Google ResearchvendorResearch team

Individuals

  • Ashish VaswaniLead author, Google BrainCo-invented Transformer architecture
  • Noam ShazeerCo-author, GoogleCo-invented Transformer, later co-founded Character.AI
  • Illia PolosukhinCo-author, GoogleCo-invented Transformer, later co-founded NEAR Protocol

Artefacts

  • TransformerspecificationNeural architecture based on self-attention
  • Self-attentionmethodologyMechanism relating different positions in a sequence
  • Multi-head attentionmethodologyParallel attention with multiple learned projections

Key terms

Transformerattentionself-attentionBERTGPTencoder-decoder

Causality

Preceded by: AlexNet Wins ImageNet: Deep Learning Revolution Begins.

Made possible: Large Language Models: GPT and the Scaling Era.

On this course

Read in the path AI: From Turing to Transformers.

Sources

1Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. "Attention Is All You Need". Google, 2017-06-12.peer reviewedarxiv.org/abs/1706.03762
2Google Research. "Attention is All You Need". Google Research, 2017.authoritativeresearch.google/pubs/pub46201