Loading lesson...
Loading lesson...
Modules 1-3 covered what AI is, why data matters, and how machines learn. This module goes inside the machine: how does a neural network actually work? You will trace data through a network from input to output, then follow the error signal backwards as the network learns.

Real-world history · 1958 & the AI winter that followed
In 1958, Frank Rosenblatt, a psychologist at the Cornell Aeronautical Laboratory, built the Mark I Perceptron. It was a physical machine, wired to a 20×20 grid of photocells, that could learn to classify simple visual patterns. The New York Times reported it as the "embryo of an electronic computer that the Navy expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."
The Perceptron was genuinely innovative. It was the first machine that could learn from examples by adjusting weights automatically. Rosenblatt proved mathematically that if a linear solution to a classification problem exists, the Perceptron learning algorithm will find it (the Perceptron Convergence Theorem).
In 1969, Marvin Minsky and Seymour Papert published Perceptrons, which proved that a single-layer Perceptron cannot solve the XOR problem (a simple logical function where the output is true if exactly one input is true). This limitation was real, but the conclusion many drew, that neural networks were a dead end, was premature. Multi-layer networks can solve XOR easily. But funding dried up, and the first "AI winter" lasted through much of the 1970s.
Rosenblatt's Perceptron could learn, but it could not solve a simple logical problem. Was the criticism fair, or did it set the field back unnecessarily?
The Perceptron story contains two lessons relevant today. First, a single neuron (one computation unit) is limited, but networks of neurons can solve problems that individual neurons cannot. Second, hype followed by disappointment is a recurring pattern in AI history. Understanding how neural networks actually work, rather than what headlines claim they do, protects you from both failure modes.
If you already understand forward propagation and backpropagation, use the knowledge checks to confirm and skip to Module 5: Evaluating AI.
With the learning outcomes established, this module begins by examining the single neuron: weighted sum, bias, activation in depth.
A single artificial neuron performs three operations:
Rosenblatt's Perceptron used a step function as its activation: output 1 if the weighted sum exceeds a threshold, output 0 otherwise. Modern networks use smooth functions (ReLU, sigmoid) that allow gradients to flow during training.
With an understanding of the single neuron: weighted sum, bias, activation in place, the discussion can now turn to layers and forward propagation, which builds directly on these foundations.
“The key insight is that a multi-layer network of simple computing units can learn complex functions, provided there is a way to adjust the weights based on the error of the network's output.”
David Rumelhart, Geoffrey Hinton, Ronald Williams - 'Learning representations by back-propagating errors', Nature, Volume 323 (October 1986)
This 1986 paper revived neural networks after the AI winter by demonstrating backpropagation, the algorithm that allows multi-layer networks to learn by propagating error gradients backwards through the network. Hinton later won the 2024 Nobel Prize in Physics for foundational work on neural networks.
A neural network arranges neurons in layers:
Forward propagation is the process of passing data through the network from input to output. Each neuron in each layer computes its weighted sum + bias + activation, and passes the result to the next layer. At the end, the output layer produces a prediction. The loss function then measures how far that prediction is from the correct answer.
The term "deep" in deep learning refers to the depth of the network: the number of hidden layers. Modern networks can have tens, hundreds, or even thousands of layers. GPT-4 is believed to have on the order of 120 transformer layers (the exact architecture is not publicly documented).
With an understanding of layers and forward propagation in place, the discussion can now turn to backpropagation: how the network learns from mistakes, which builds directly on these foundations.
Forward propagation gives us a prediction. The loss function tells us how wrong it is. Backpropagation answers the question: which weights contributed most to the error, and in which direction should we adjust them?
The algorithm works backwards through the network:
This process, forward pass → compute loss → backward pass → update weights, is one training step. It repeats thousands or millions of times until the loss converges to a minimum.
With an understanding of backpropagation: how the network learns from mistakes in place, the discussion can now turn to the learning rate: too fast, too slow, or just right, which builds directly on these foundations.
Common misconception
“Artificial neural networks work like biological brains”
The analogy is loose at best. Biological neurons communicate through complex electrochemical signals with timing, inhibition, and neuromodulation. Artificial neurons compute simple weighted sums with fixed activation functions. The architectures that work best in practice (transformers, convolutions) bear little resemblance to known brain architecture. The most successful AI advances have come from engineering and mathematical insights, not from copying biology.
Common misconception
“You need a powerful GPU to learn about neural networks”
Understanding neural networks requires tracing the mathematics: weighted sums, activation functions, gradients, weight updates. A network with 2 inputs, 3 hidden neurons, and 1 output can be computed entirely by hand. Production-scale training requires GPUs, but learning the concepts does not. Start with small networks you can trace manually. Once the concepts are clear, scaling to large networks is a matter of computation, not comprehension.
The learning rate is arguably the most important hyperparameter in neural network training. It controls how large each weight update step is:
Common starting values are 0.001 or 0.01. Learning rate schedulers reduce the rate over time: large steps early (to make rapid progress) and smaller steps later (to fine-tune). The gradient step tool below lets you experiment with this directly.
“What I cannot create, I do not understand.”
Richard Feynman - Written on his blackboard at the time of his death (February 1988)
Feynman's principle applies directly to learning neural networks. Tracing forward and backward passes through a small network by hand builds understanding that reading equations alone does not. The gradient step tool below lets you 'create' the learning process and observe it directly.
A neural network with two hidden layers uses ReLU activation functions. If you removed all activation functions (making every neuron a pure linear transformation), what would happen?
You set the learning rate to 10.0 and observe that the loss jumps to infinity after a few training steps. What happened?
During backpropagation, the gradient of the loss with respect to a particular weight is -0.05. The learning rate is 0.01. What happens to this weight?
Full paper (Sections I-IV)
The original Perceptron paper. Introduced the first learning algorithm for artificial neural networks and proved the Perceptron Convergence Theorem. Used as the opening case study.
Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry (1969)
Chapter 12 (the XOR problem), Expanded Edition (1988) epilogue
Proved that single-layer Perceptrons cannot solve XOR. The book's influence triggered the first AI winter by discouraging neural network research. The 1988 expanded edition acknowledged that multi-layer networks could overcome this limitation.
Full paper (3 pages)
The paper that popularised backpropagation for multi-layer neural networks, ending the first AI winter. Demonstrated that error gradients could be propagated backwards through layers to train deep networks. Hinton received the 2024 Nobel Prize in Physics partly for this work.
Diederik P. Kingma and Jimmy Ba, 'Adam: A Method for Stochastic Optimization', ICLR 2015
Algorithm 1 (Adam update rule)
Introduced the Adam optimiser, now the default for most neural network training. Adam adapts learning rates per-parameter using estimates of first and second moments of the gradient. Referenced in Section 4.4 as the standard modern optimiser.
Michael Nielsen, Neural Networks and Deep Learning (2015, free online book)
Chapter 1 (Using neural nets to recognize handwritten digits), Chapter 2 (How backpropagation works)
The most accessible introduction to backpropagation available. Nielsen traces the mathematics step by step with worked examples. Recommended as supplementary reading for this module.
You now understand how a neural network computes (forward propagation), how it learns (backpropagation), and how the learning rate controls the process. The next question is: how do you know if a trained model is actually good? Module 5 introduces evaluation metrics: accuracy, precision, recall, F1, confusion matrices, and why the right metric depends on the problem.
Module 4 of 24 · AI Foundations