Skip to content

AI: From Turing to Transformers

The complete history of artificial intelligence from the 1950 Turing Test to modern large language models. Understand the breakthroughs, winters, and paradigm shifts.

AI: From Turing to Transformers guided path map

An ordered sequence of 8 events covering 60 minutes.

AI: From Turing to Transformers guided path An ordered sequence of 7 dated stops covering 60 minutes. Each card lists a stop number, the year, the title and the publisher. The final stop is emphasised in brand red. AI: FROM TURING TO TRANSFORMERS Source: Computer History Museum From the imitation game to the architecture that scaled. next row STOP 1 1950 Turing's imitationgame paper turingarchive.org STOP 2 1956 Dartmouth summerworkshop stanford.edu STOP 3 1986 Backpropagationpopularised nature.com STOP 4 1997 LSTM by Hochreiter andSchmidhuber mit.edu STOP 5 2012 AlexNet wins ImageNet image-net.org STOP 6 2014 Sequence to sequencelearning arxiv.org/abs/1409.3215 STOP 7 2017 Transformerarchitecture arxiv.org/abs/1706.03762 7 stops, in chronological order Arrows mark the next stop in the path, not direct historical causality. The final stop is emphasised in brand red.

Arrows mark the next step in the path, not direct historical causality. Source: Computer History Museum.

1. Computing Machinery and Intelligence

October 1950.Artificial intelligence.Paradigm shift.Event page

There was no rigorous framework for discussing machine intelligence. The question 'Can machines think?' seemed philosophical rather than scientific. No criteria existed for evaluating claims about machine intelligence.

Alan Turing proposed the 'imitation game' (later called the Turing Test) as an operational definition of machine intelligence. Rather than asking 'Can machines think?', Turing reframed the question in terms of observable behaviour: can a machine's responses be indistinguishable from a human's?

Turing published 'Computing Machinery and Intelligence' in the journal Mind in October 1950. The paper addressed objections to machine intelligence and proposed the imitation game as a practical test. This paper is considered one of the founding documents of artificial intelligence.1

2. Dartmouth Summer Project

18 June 1956 to 17 August 1956.Artificial intelligence.Paradigm shift.Event page

Research on machine intelligence was scattered across different disciplines with no unifying identity. There was no common terminology, no shared research agenda, and no recognition of a distinct field.

The Dartmouth Summer Research Project on Artificial Intelligence established AI as a distinct academic discipline. The term 'artificial intelligence' was coined. Key researchers gathered to define the field's scope and approach, creating a research community that would shape the next decades.

John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon proposed a two-month workshop at Dartmouth College. The proposal, submitted in August 1955, outlined ambitious goals including language use, abstraction, and self-improvement. The workshop ran in summer 1956, though attendance was sporadic.2, 1

3. Lighthill report, first winter

1974 to 1993.Artificial intelligence.Paradigm shift.Event page

Early AI research promised rapid progress towards human-level intelligence. Government and industry invested heavily based on optimistic predictions. Initial successes in narrow domains fuelled expectations of general breakthroughs.

Two major 'AI winters' saw dramatic reductions in funding and interest. The first (mid-1970s) followed the Lighthill Report and DARPA cuts. The second (late 1980s-early 1990s) followed the collapse of the expert systems market. Research continued but with reduced resources and tempered expectations.

The 1973 Lighthill Report criticised AI's failure to achieve ambitious goals, leading to UK funding cuts. DARPA reduced AI funding after projects failed to meet milestones. The second winter followed the collapse of specialised AI hardware companies (Lisp machines) and disillusionment with expert systems' limitations.3, 4

4. Backpropagation revival

9 October 1986.Artificial intelligence.Publication.Event page

The Perceptrons book (1969) had demonstrated limitations of single-layer neural networks, contributing to reduced interest in connectionist approaches. Multi-layer networks could theoretically overcome these limitations but there was no efficient training algorithm.

Rumelhart, Hinton, and Williams published a clear description of backpropagation for training multi-layer neural networks. While the algorithm had been discovered earlier, this paper made it accessible and demonstrated its power, reviving interest in neural networks.

The paper 'Learning representations by back-propagating errors' was published in Nature in October 1986. It showed how to efficiently compute gradients through multiple layers using the chain rule, enabling networks to learn internal representations. The clear exposition and compelling results sparked renewed interest in connectionism.5, 6, 7

5. Deep Blue defeats Kasparov

11 May 1997.Artificial intelligence.Major incident.Event page

Chess had been considered a benchmark for machine intelligence since the field's inception. Despite decades of progress, no computer had defeated a reigning world champion in a match. Many believed human intuition and creativity gave an insurmountable advantage.

IBM's Deep Blue defeated Garry Kasparov, the reigning world chess champion, in a six-game match. This was the first time a computer beat a world champion under standard tournament conditions. The victory demonstrated that machines could outperform humans in complex cognitive tasks.

Deep Blue was a specialised chess computer capable of evaluating 200 million positions per second. It used alpha-beta search with sophisticated evaluation functions refined by grandmasters. After losing to Kasparov in 1996, the team improved both hardware and software. The 1997 rematch ended 3.5-2.5 in Deep Blue's favour.8

6. AlexNet wins ImageNet

30 September 2012.Artificial intelligence.Paradigm shift.Event page

Computer vision relied on hand-crafted features (SIFT, HOG) combined with classifiers. Progress on image recognition had plateaued. Neural networks were considered too slow to train on large datasets. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) error rates had stagnated.

AlexNet, a deep convolutional neural network, won the 2012 ImageNet challenge with a top-5 error rate of 15.3%, compared to 26.2% for the second-place entry. This dramatic improvement demonstrated the power of deep learning and GPU-accelerated training, triggering a revolution in AI research.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton trained a deep CNN on 1.2 million images using two GTX 580 GPUs. Key innovations included ReLU activations (avoiding vanishing gradients), dropout regularisation, and data augmentation. The clear victory forced the computer vision community to adopt deep learning.9, 10

7. Attention Is All You Need

12 June 2017.Artificial intelligence.Paradigm shift.Event page

Sequence models (RNNs, LSTMs) processed input sequentially, limiting parallelisation and making it difficult to capture long-range dependencies. Training on long sequences was slow and gradient flow was problematic. Machine translation quality had plateaued.

The Transformer architecture replaced recurrence with self-attention, enabling parallel processing of entire sequences. This dramatically improved training speed and model quality. The architecture became the foundation for GPT, BERT, and virtually all modern large language models.

Researchers at Google published 'Attention Is All You Need' in June 2017. The paper introduced multi-head self-attention, positional encoding, and the encoder-decoder Transformer structure. The model achieved state-of-the-art translation quality while training faster than RNN-based systems.11, 12

8. Generative pre-training

June 2018 to November 2022.Artificial intelligence.Paradigm shift.Event page

NLP systems required task-specific architectures and training. Transfer learning was limited. No single model could handle diverse language tasks. Conversational AI remained stilted and narrow.

Large language models (LLMs) demonstrated that scaling Transformer models on vast text corpora yields emergent capabilities. GPT-3 (2020) showed few-shot learning across diverse tasks. ChatGPT (2022) made conversational AI accessible to the public, triggering widespread AI adoption and debate.

OpenAI released GPT (2018), GPT-2 (2019), and GPT-3 (2020), each dramatically larger. GPT-3's 175 billion parameters showed remarkable few-shot capabilities. Google's BERT (2018) demonstrated bidirectional pretraining. ChatGPT (November 2022) combined GPT-3.5 with RLHF, achieving unprecedented public adoption and sparking global conversation about AI.13

Sources

1Alan M. Turing. "Computing Machinery and Intelligence". University of Manchester, 1950-10.peer reviewedacademic.oup.com/mind/article/LIX/236/433/986238
2John McCarthy, Marvin L. Minsky, Nathaniel Rochester, Claude E. Shannon. "A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence". Dartmouth College, 1955-08-31.authoritativewww-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html
3James Lighthill. "Artificial Intelligence: A General Survey". Science Research Council (UK), 1973.authoritative
4Marvin Minsky, Seymour Papert. "Perceptrons: An Introduction to Computational Geometry". MIT Press, 1969.peer reviewed
5David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams. "Learning representations by back-propagating errors". Nature, 1986-10-09.peer reviewedwww.nature.com/articles/323533a0
6Paul J. Werbos. "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences". Harvard University, 1974.reputablewww.researchgate.net/publication/35657389_Beyond_regression_new_tools_for_prediction_and_analysis_in_the_behavioral_sciences
7David E. Rumelhart, James L. McClelland. "Parallel Distributed Processing: Explorations in the Microstructure of Cognition". MIT Press, 1986.reputablemitpress.mit.edu/9780262680530/parallel-distributed-processing/
8Murray Campbell, A. Joseph Hoane Jr., Feng-hsiung Hsu. "Deep Blue". IBM, 2002.peer reviewedwww.sciencedirect.com/science/article/pii/S0004370201001291
9Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks". University of Toronto, 2012.peer reviewedpapers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
10Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li Fei-Fei. "ImageNet Large Scale Visual Recognition Challenge". Stanford University, 2015-04-11.peer reviewedlink.springer.com/article/10.1007/s11263-015-0816-y
11Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. "Attention Is All You Need". Google, 2017-06-12.peer reviewedarxiv.org/abs/1706.03762
12Google Research. "Attention is All You Need". Google Research, 2017.authoritativeresearch.google/pubs/pub46201
13Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell. "Language Models are Few-Shot Learners". OpenAI, 2020-05-28.peer reviewedarxiv.org/abs/2005.14165