Module 4 of 24 · Foundations

Neural networks from scratch

45 min read 3 outcomes Gradient step tool + drag challenge 5 sources cited

Modules 1-3 covered what AI is, why data matters, and how machines learn. This module goes inside the machine: how does a neural network actually work? You will trace data through a network from input to output, then follow the error signal backwards as the network learns.

By the end of this module you will be able to:

Explain how a single neuron computes a weighted sum, adds bias, and applies activation
Trace forward propagation and backpropagation through a simple network
Adjust learning rate and observe how it affects convergence in the gradient step tool

Close-up of a printed circuit board with interconnected traces and microchips

Real-world history · 1958 & the AI winter that followed

The first neural network promised to make machines think. Then critics killed the field for a decade.

In 1958, Frank Rosenblatt, a psychologist at the Cornell Aeronautical Laboratory, built the Mark I Perceptron. It was a physical machine, wired to a 20×20 grid of photocells, that could learn to classify simple visual patterns. The New York Times reported it as the "embryo of an electronic computer that the Navy expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."

The Perceptron was genuinely innovative. It was the first machine that could learn from examples by adjusting weights automatically. Rosenblatt proved mathematically that if a linear solution to a classification problem exists, the Perceptron learning algorithm will find it (the Perceptron Convergence Theorem).

In 1969, Marvin Minsky and Seymour Papert published Perceptrons, which proved that a single-layer Perceptron cannot solve the XOR problem (a simple logical function where the output is true if exactly one input is true). This limitation was real, but the conclusion many drew, that neural networks were a dead end, was premature. Multi-layer networks can solve XOR easily. But funding dried up, and the first "AI winter" lasted through much of the 1970s.

Rosenblatt's Perceptron could learn, but it could not solve a simple logical problem. Was the criticism fair, or did it set the field back unnecessarily?

The Perceptron story contains two lessons relevant today. First, a single neuron (one computation unit) is limited, but networks of neurons can solve problems that individual neurons cannot. Second, hype followed by disappointment is a recurring pattern in AI history. Understanding how neural networks actually work, rather than what headlines claim they do, protects you from both failure modes.

If you already understand forward propagation and backpropagation, use the knowledge checks to confirm and skip to Module 5: Evaluating AI.

With the learning outcomes established, this module begins by examining the single neuron: weighted sum, bias, activation in depth.

4.1 The single neuron: weighted sum, bias, activation

A single artificial neuron performs three operations:

Weighted sum. Each input is multiplied by a weight (a number that represents how important that input is). The weighted inputs are summed together. If you have three inputs x1, x2, x3 with weights w1, w2, w3, the weighted sum is: w1×x1 + w2×x2 + w3×x3.
Bias. A bias term (b) is added to the weighted sum. The bias allows the neuron to shift its output. Without bias, the neuron's decision boundary must pass through the origin. With bias, it can be placed anywhere: w1×x1 + w2×x2 + w3×x3 + b.
Activation function. The sum is passed through a non-linear function. Common choices include ReLU (output is zero for negative inputs, unchanged for positive), sigmoid (squashes output to 0-1 range), and tanh (squashes to -1 to 1 range). Without a non-linear activation, stacking layers would be equivalent to a single linear transformation, no matter how many layers you add.

Rosenblatt's Perceptron used a step function as its activation: output 1 if the weighted sum exceeds a threshold, output 0 otherwise. Modern networks use smooth functions (ReLU, sigmoid) that allow gradients to flow during training.

With an understanding of the single neuron: weighted sum, bias, activation in place, the discussion can now turn to layers and forward propagation, which builds directly on these foundations.

“The key insight is that a multi-layer network of simple computing units can learn complex functions, provided there is a way to adjust the weights based on the error of the network's output.”
David Rumelhart, Geoffrey Hinton, Ronald Williams - 'Learning representations by back-propagating errors', Nature, Volume 323 (October 1986)
This 1986 paper revived neural networks after the AI winter by demonstrating backpropagation, the algorithm that allows multi-layer networks to learn by propagating error gradients backwards through the network. Hinton later won the 2024 Nobel Prize in Physics for foundational work on neural networks.

4.2 Layers and forward propagation

A neural network arranges neurons in layers:

Input layer: Receives the raw data. Each neuron corresponds to one feature. An image with 784 pixels has 784 input neurons.
Hidden layers: One or more layers between input and output. These layers learn increasingly abstract representations. The first hidden layer might detect edges, the second might detect shapes, the third might detect objects.
Output layer: Produces the final prediction. For classification with 10 classes, the output layer has 10 neurons, each representing the probability of one class.

Forward propagation is the process of passing data through the network from input to output. Each neuron in each layer computes its weighted sum + bias + activation, and passes the result to the next layer. At the end, the output layer produces a prediction. The loss function then measures how far that prediction is from the correct answer.

The term "deep" in deep learning refers to the depth of the network: the number of hidden layers. Modern networks can have tens, hundreds, or even thousands of layers. GPT-4 is believed to have on the order of 120 transformer layers (the exact architecture is not publicly documented).

With an understanding of layers and forward propagation in place, the discussion can now turn to backpropagation: how the network learns from mistakes, which builds directly on these foundations.

4.3 Backpropagation: how the network learns from mistakes

Forward propagation gives us a prediction. The loss function tells us how wrong it is. Backpropagation answers the question: which weights contributed most to the error, and in which direction should we adjust them?

The algorithm works backwards through the network:

Compute the gradient (rate of change) of the loss with respect to each weight in the output layer. The gradient tells you: if you increase this weight slightly, does the loss go up or down? And by how much?
Use the chain rule from calculus to propagate these gradients backwards through each hidden layer. This is where the name "backpropagation" comes from: the error signal propagates backward.
Update each weight by a small amount in the direction that reduces the loss. The size of the update is controlled by the learning rate, a hyperparameter you must choose.

This process, forward pass → compute loss → backward pass → update weights, is one training step. It repeats thousands or millions of times until the loss converges to a minimum.

With an understanding of backpropagation: how the network learns from mistakes in place, the discussion can now turn to the learning rate: too fast, too slow, or just right, which builds directly on these foundations.

Common misconception

“Artificial neural networks work like biological brains”

The analogy is loose at best. Biological neurons communicate through complex electrochemical signals with timing, inhibition, and neuromodulation. Artificial neurons compute simple weighted sums with fixed activation functions. The architectures that work best in practice (transformers, convolutions) bear little resemblance to known brain architecture. The most successful AI advances have come from engineering and mathematical insights, not from copying biology.

Common misconception

“You need a powerful GPU to learn about neural networks”

Understanding neural networks requires tracing the mathematics: weighted sums, activation functions, gradients, weight updates. A network with 2 inputs, 3 hidden neurons, and 1 output can be computed entirely by hand. Production-scale training requires GPUs, but learning the concepts does not. Start with small networks you can trace manually. Once the concepts are clear, scaling to large networks is a matter of computation, not comprehension.

4.4 The learning rate: too fast, too slow, or just right

The learning rate is arguably the most important hyperparameter in neural network training. It controls how large each weight update step is:

Too large: The model takes big steps and overshoots the minimum. Loss oscillates wildly or even diverges to infinity. Think of walking towards a valley by taking giant leaps: you end up bouncing between the slopes and never reaching the bottom.
Too small: The model takes tiny steps. Training is extremely slow and may get stuck in a local minimum (a point that looks like the lowest but is not the global lowest). Think of taking one-millimetre steps towards the valley: you will get there, but it takes forever.
Just right: The model converges smoothly to a good minimum. Finding the right learning rate typically requires experimentation. Modern optimisers (Adam, AdaGrad) adapt the learning rate during training, but the initial value still matters.

Common starting values are 0.001 or 0.01. Learning rate schedulers reduce the rate over time: large steps early (to make rapid progress) and smaller steps later (to fine-tune). The gradient step tool below lets you experiment with this directly.

Loading interactive component...

“What I cannot create, I do not understand.”
Richard Feynman - Written on his blackboard at the time of his death (February 1988)
Feynman's principle applies directly to learning neural networks. Tracing forward and backward passes through a small network by hand builds understanding that reading equations alone does not. The gradient step tool below lets you 'create' the learning process and observe it directly.

4.5 Check your understanding

A neural network with two hidden layers uses ReLU activation functions. If you removed all activation functions (making every neuron a pure linear transformation), what would happen?

You set the learning rate to 10.0 and observe that the loss jumps to infinity after a few training steps. What happened?

During backpropagation, the gradient of the loss with respect to a particular weight is -0.05. The learning rate is 0.01. What happens to this weight?

Loading interactive component...

Key takeaways

A single neuron computes three operations: weighted sum of inputs, addition of bias, and application of a non-linear activation function. Without non-linearity, stacking layers adds no expressive power.
Neural networks arrange neurons in layers: input, hidden, and output. Forward propagation passes data through all layers to produce a prediction. The depth of a network (number of hidden layers) determines its capacity to learn complex patterns.
Backpropagation computes how much each weight contributed to the error, using the chain rule to propagate gradients backwards from the loss function through every layer. Weight updates move each weight in the direction that reduces loss.
The learning rate controls the size of weight updates. Too large causes divergence; too small causes slow training and potential local minima. Modern optimisers (Adam) adapt the rate but the initial value still matters.
The Perceptron (1958) was the first learning neural network. Its limitations (cannot solve XOR) triggered the first AI winter. Multi-layer networks with backpropagation (1986) overcame this limitation and revived the field.
Batch normalisation, dropout, and residual connections are engineering innovations that make deep networks trainable. Without them, gradients vanish or explode as networks grow deeper. Knowing when to apply each technique matters more than memorising the theory.

Standards and sources cited in this module

Frank Rosenblatt, 'The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain', Psychological Review, Vol. 65, No. 6 (1958)
Full paper (Sections I-IV)
The original Perceptron paper. Introduced the first learning algorithm for artificial neural networks and proved the Perceptron Convergence Theorem. Used as the opening case study.
Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry (1969)
Chapter 12 (the XOR problem), Expanded Edition (1988) epilogue
Proved that single-layer Perceptrons cannot solve XOR. The book's influence triggered the first AI winter by discouraging neural network research. The 1988 expanded edition acknowledged that multi-layer networks could overcome this limitation.
David Rumelhart, Geoffrey Hinton, Ronald Williams, 'Learning representations by back-propagating errors', Nature, Volume 323 (October 1986)
Full paper (3 pages)
The paper that popularised backpropagation for multi-layer neural networks, ending the first AI winter. Demonstrated that error gradients could be propagated backwards through layers to train deep networks. Hinton received the 2024 Nobel Prize in Physics partly for this work.
Diederik P. Kingma and Jimmy Ba, 'Adam: A Method for Stochastic Optimization', ICLR 2015
Algorithm 1 (Adam update rule)
Introduced the Adam optimiser, now the default for most neural network training. Adam adapts learning rates per-parameter using estimates of first and second moments of the gradient. Referenced in Section 4.4 as the standard modern optimiser.
Michael Nielsen, Neural Networks and Deep Learning (2015, free online book)
Chapter 1 (Using neural nets to recognize handwritten digits), Chapter 2 (How backpropagation works)
The most accessible introduction to backpropagation available. Nielsen traces the mathematics step by step with worked examples. Recommended as supplementary reading for this module.

You now understand how a neural network computes (forward propagation), how it learns (backpropagation), and how the learning rate controls the process. The next question is: how do you know if a trained model is actually good? Module 5 introduces evaluation metrics: accuracy, precision, recall, F1, confusion matrices, and why the right metric depends on the problem.

Previous: How machines learn Next: Evaluating AI

Module 4 of 24 · AI Foundations

Loading lesson...

The first neural network promised to make machines think. Then critics killed the field for a decade.

Rosenblatt's Perceptron could learn, but it could not solve a simple logical problem. Was the criticism fair, or did it set the field back unnecessarily?

4.1 The single neuron: weighted sum, bias, activation

A single artificial neuron performs three operations:

Weighted sum. Each input is multiplied by a weight (a number that represents how important that input is). The weighted inputs are summed together. If you have three inputs x1, x2, x3 with weights w1, w2, w3, the weighted sum is: w1×x1 + w2×x2 + w3×x3.
Bias. A bias term (b) is added to the weighted sum. The bias allows the neuron to shift its output. Without bias, the neuron's decision boundary must pass through the origin. With bias, it can be placed anywhere: w1×x1 + w2×x2 + w3×x3 + b.
Activation function. The sum is passed through a non-linear function. Common choices include ReLU (output is zero for negative inputs, unchanged for positive), sigmoid (squashes output to 0-1 range), and tanh (squashes to -1 to 1 range). Without a non-linear activation, stacking layers would be equivalent to a single linear transformation, no matter how many layers you add.

4.2 Layers and forward propagation

A neural network arranges neurons in layers:

Input layer: Receives the raw data. Each neuron corresponds to one feature. An image with 784 pixels has 784 input neurons.
Hidden layers: One or more layers between input and output. These layers learn increasingly abstract representations. The first hidden layer might detect edges, the second might detect shapes, the third might detect objects.
Output layer: Produces the final prediction. For classification with 10 classes, the output layer has 10 neurons, each representing the probability of one class.

4.3 Backpropagation: how the network learns from mistakes

The algorithm works backwards through the network:

Compute the gradient (rate of change) of the loss with respect to each weight in the output layer. The gradient tells you: if you increase this weight slightly, does the loss go up or down? And by how much?
Use the chain rule from calculus to propagate these gradients backwards through each hidden layer. This is where the name "backpropagation" comes from: the error signal propagates backward.
Update each weight by a small amount in the direction that reduces the loss. The size of the update is controlled by the learning rate, a hyperparameter you must choose.

This process, forward pass → compute loss → backward pass → update weights, is one training step. It repeats thousands or millions of times until the loss converges to a minimum.

4.4 The learning rate: too fast, too slow, or just right

The learning rate is arguably the most important hyperparameter in neural network training. It controls how large each weight update step is:

Too large: The model takes big steps and overshoots the minimum. Loss oscillates wildly or even diverges to infinity. Think of walking towards a valley by taking giant leaps: you end up bouncing between the slopes and never reaching the bottom.
Too small: The model takes tiny steps. Training is extremely slow and may get stuck in a local minimum (a point that looks like the lowest but is not the global lowest). Think of taking one-millimetre steps towards the valley: you will get there, but it takes forever.
Just right: The model converges smoothly to a good minimum. Finding the right learning rate typically requires experimentation. Modern optimisers (Adam, AdaGrad) adapt the learning rate during training, but the initial value still matters.

Key takeaways

A single neuron computes three operations: weighted sum of inputs, addition of bias, and application of a non-linear activation function. Without non-linearity, stacking layers adds no expressive power.

Neural networks arrange neurons in layers: input, hidden, and output. Forward propagation passes data through all layers to produce a prediction. The depth of a network (number of hidden layers) determines its capacity to learn complex patterns.

Backpropagation computes how much each weight contributed to the error, using the chain rule to propagate gradients backwards from the loss function through every layer. Weight updates move each weight in the direction that reduces loss.

The learning rate controls the size of weight updates. Too large causes divergence; too small causes slow training and potential local minima. Modern optimisers (Adam) adapt the rate but the initial value still matters.

The Perceptron (1958) was the first learning neural network. Its limitations (cannot solve XOR) triggered the first AI winter. Multi-layer networks with backpropagation (1986) overcame this limitation and revived the field.

Batch normalisation, dropout, and residual connections are engineering innovations that make deep networks trainable. Without them, gradients vanish or explode as networks grow deeper. Knowing when to apply each technique matters more than memorising the theory.

Standards and sources cited in this module

Frank Rosenblatt, 'The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain', Psychological Review, Vol. 65, No. 6 (1958)

Full paper (Sections I-IV)

The original Perceptron paper. Introduced the first learning algorithm for artificial neural networks and proved the Perceptron Convergence Theorem. Used as the opening case study.

Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry (1969)

Chapter 12 (the XOR problem), Expanded Edition (1988) epilogue

Proved that single-layer Perceptrons cannot solve XOR. The book's influence triggered the first AI winter by discouraging neural network research. The 1988 expanded edition acknowledged that multi-layer networks could overcome this limitation.

David Rumelhart, Geoffrey Hinton, Ronald Williams, 'Learning representations by back-propagating errors', Nature, Volume 323 (October 1986)

Full paper (3 pages)

The paper that popularised backpropagation for multi-layer neural networks, ending the first AI winter. Demonstrated that error gradients could be propagated backwards through layers to train deep networks. Hinton received the 2024 Nobel Prize in Physics partly for this work.

Diederik P. Kingma and Jimmy Ba, 'Adam: A Method for Stochastic Optimization', ICLR 2015

Algorithm 1 (Adam update rule)

Introduced the Adam optimiser, now the default for most neural network training. Adam adapts learning rates per-parameter using estimates of first and second moments of the gradient. Referenced in Section 4.4 as the standard modern optimiser.

Michael Nielsen, Neural Networks and Deep Learning (2015, free online book)

Chapter 1 (Using neural nets to recognize handwritten digits), Chapter 2 (How backpropagation works)

The most accessible introduction to backpropagation available. Nielsen traces the mathematics step by step with worked examples. Recommended as supplementary reading for this module.