Loading lesson...
Loading lesson...
Understand image classification with convolutional neural networks, real-time object detection with YOLO, semantic segmentation, and generative image models including diffusion (DDPM, Stable Diffusion).
By the end of this module you will be able to:

Uber self-driving fatality, Tempe AZ, 18 March 2018
On 18 March 2018, an Uber autonomous test vehicle struck and killed Elaine Herzberg as she walked her bicycle across a road in Tempe, Arizona. The NTSB investigation revealed that the vehicle's perception system detected Herzberg approximately 5.6 seconds before impact but repeatedly reclassified her: first as an unknown object, then as a vehicle, then as a bicycle, each reclassification resetting the prediction of her path.
The system's object detection model could detect her presence but could not maintain a consistent classification across frames. Worse, the software had been configured to suppress false positives by requiring a stable classification before initiating emergency braking. The safety driver was watching a video on her phone. The car never braked.
The incident exposed critical failures in computer vision system design: classification confidence thresholds were tuned for comfort rather than safety, object tracking did not persist across reclassification events, and the system had no fallback for ambiguous detections. Uber suspended autonomous testing for nine months. The NTSB cited inadequate safety culture as the probable cause.
Convolutional neural networks (CNNs) are the foundation of modern computer vision. A CNN processes an image through a series of convolutional layers, each applying learned filters (kernels) that detect increasingly complex patterns. Early layers detect edges and textures. Middle layers detect parts (eyes, wheels, corners). Deep layers detect complete objects and scenes.
A convolutional layer slides a small filter (typically 3x3 or 5x5 pixels) across the input image, computing a dot product at each position. This produces a feature map that highlights where the filter's pattern appears. Pooling layers (typically max pooling) downsample feature maps, reducing spatial dimensions while preserving the strongest activations. The alternation of convolution and pooling creates a hierarchy from local features to global understanding.
Key architectures include AlexNet (2012, proved deep CNNs work on ImageNet), VGGNet (2014, very deep with small 3x3 filters), ResNet (2015, residual connections enabling 152+ layers), and EfficientNet (2019, neural architecture search for optimal width/depth/resolution scaling). Transfer learning from pre-trained ImageNet models is standard practice: fine-tune a ResNet trained on 14 million images rather than training from scratch.
With an understanding of convolutional neural networks in place, the discussion can now turn to image classification, which builds directly on these foundations.
Image classification assigns a single label to an entire image. The CNN extracts features through convolutional layers, then a fully connected classification head maps the final feature vector to a probability distribution over classes using softmax. The model outputs the class with the highest probability as its prediction.
ImageNet, the benchmark that drove the deep learning revolution, contains 14 million images across 1,000 classes. Human top-5 error rate is approximately 5.1%. ResNet surpassed this in 2015 with 3.6% top-5 error. Modern architectures like EfficientNet and Vision Transformers (ViT) achieve below 2%.
Vision Transformers (ViT, Dosovitskiy et al., 2020) apply the transformer architecture to images by splitting the image into fixed-size patches (e.g., 16x16 pixels), treating each patch as a token, and processing the sequence with standard transformer self-attention. ViT demonstrates that the transformer architecture generalises beyond text when given sufficient training data and compute.
With an understanding of image classification in place, the discussion can now turn to object detection: yolo, which builds directly on these foundations.
Object detection locates and classifies multiple objects within an image, outputting bounding boxes with class labels and confidence scores. Earlier approaches like R-CNN used a two-stage pipeline: propose regions, then classify each region. This was accurate but slow (approximately 0.5 FPS).
YOLO (You Only Look Once) reformulated object detection as a single regression problem. The network divides the image into a grid and predicts bounding boxes and class probabilities for each grid cell in a single forward pass. YOLOv1 (2016) ran at 45 FPS on a GPU. Current versions (YOLOv8, 2023) achieve modern accuracy while running at over 100 FPS.
Each prediction includes: bounding box coordinates (x, y, width, height), an objectness score (probability that the box contains any object), and class probabilities. Non-maximum suppression (NMS) removes duplicate detections by suppressing overlapping boxes with lower confidence scores. The mean Average Precision (mAP) metric evaluates detection accuracy across all classes at different intersection-over-union (IoU) thresholds.
With an understanding of object detection: yolo in place, the discussion can now turn to semantic segmentation, which builds directly on these foundations.
“We frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.”
Redmon et al., 'You Only Look Once: Unified, Real-Time Object Detection', 2016
Semantic segmentation assigns a class label to every pixel in an image. Where object detection draws bounding boxes, segmentation produces pixel-precise masks. This is critical for applications where shape matters: autonomous driving (distinguish road from pavement from obstacle), medical imaging (delineate tumour boundaries), and satellite analysis (land use mapping).
Fully Convolutional Networks (FCN, Long et al., 2015) were the first to achieve end-to-end pixel-wise prediction by replacing the fully connected classification head with convolutional layers and using transposed convolutions to upsample feature maps back to the original image resolution. U-Net (2015) added skip connections between the encoder (downsampling) and decoder (upsampling) paths, preserving fine spatial details that are lost during pooling.
Instance segmentation (Mask R-CNN, 2017) extends semantic segmentation by distinguishing individual objects of the same class. Panoptic segmentation combines semantic and instance segmentation, classifying every pixel while separating individual object instances.
With an understanding of semantic segmentation in place, the discussion can now turn to diffusion models: ddpm and stable diffusion, which builds directly on these foundations.
Common misconception
“Object detection and image classification are the same task.”
Classification assigns one label to the entire image: 'this image contains a cat'. Object detection locates multiple objects with bounding boxes: 'there is a cat at coordinates (120, 45, 280, 210) with 94% confidence and a dog at (350, 100, 520, 300) with 87% confidence'. Semantic segmentation goes further, classifying every single pixel. These are three distinct tasks with different architectures, loss functions, and evaluation metrics.
Diffusion models generate images by learning to reverse a gradual noising process. The forward process adds Gaussian noise to a training image over many timesteps until it becomes pure noise. The model learns to predict and remove the noise at each step, progressively recovering the original image structure.
Denoising Diffusion Probabilistic Models (DDPM, Ho et al., 2020) formalised this framework. The model is a U-Net conditioned on the timestep, trained to predict the noise that was added at each step. At inference time, the model starts from random noise and iteratively denoises it over hundreds of steps, producing a novel image from the learned distribution.
Stable Diffusion (Rombach et al., 2022) made diffusion practical by operating in a compressed latent space rather than pixel space. An encoder compresses the image to a lower-dimensional latent representation, diffusion occurs in this latent space (much cheaper computationally), and a decoder reconstructs the full image. Text conditioning is added through cross-attention with CLIP text embeddings, enabling text-to-image generation.
Classifier-free guidance (Ho and Salimans, 2022) controls the strength of text conditioning: higher guidance scales produce images that more closely match the text prompt at the cost of reduced diversity. A guidance scale of 7.5 is typical. At scale 1.0, the model largely ignores the text; at 20+, images become oversaturated and distorted.
With an understanding of diffusion models: ddpm and stable diffusion in place, the discussion can now turn to ethical considerations in computer vision, which builds directly on these foundations.
Computer vision systems inherit and amplify biases present in training data. Facial recognition systems have demonstrated significantly higher error rates for darker-skinned individuals and women (Buolamwini and Gebru, 2018). ImageNet itself contains biased label distributions and harmful category labels. These biases propagate through transfer learning to downstream applications.
Generative models raise additional concerns: deepfakes can produce convincing fabricated images and video of real people, diffusion models can reproduce copyrighted content from training data, and text-to-image systems can generate harmful or misleading content at scale. Watermarking and provenance tracking (C2PA standard) are emerging as technical countermeasures, but the fundamental challenge of distinguishing synthetic from real visual content remains unsolved.
In the diffusion process used by Stable Diffusion, what happens during the forward process and what does the model learn to do during training?
The Uber autonomous vehicle detected Elaine Herzberg 5.6 seconds before impact but repeatedly reclassified her. Which computer vision failure mode does this represent?
You can now explain how machines interpret and generate visual content, from classification to diffusion. Building AI systems is only half the challenge: getting them into production is the other half. How do you deploy, monitor, and maintain ML models in production? Module 13 covers deployment and MLOps.
Redmon et al., 'You Only Look Once: Unified, Real-Time Object Detection' (2016)
The YOLO paper that reformulated object detection as a single-pass regression problem, enabling real-time inference.
Ho et al., 'Denoising Diffusion Probabilistic Models' (DDPM, 2020)
Formalised the diffusion framework for image generation that underpins Stable Diffusion, DALL-E, and Midjourney.
Introduced latent-space diffusion, making high-resolution image generation computationally practical.
NTSB, 'Collision Between Vehicle Controlled by Developmental ADS and Pedestrian, Tempe AZ' (2019)
The NTSB investigation report on the Uber autonomous vehicle fatality, documenting the perception system failures.
He et al., 'Deep Residual Learning for Image Recognition' (ResNet, 2015)
ResNet introduced residual connections enabling training of 152+ layer CNNs, still the backbone of most transfer learning.