Module 12 of 24 · Applied

Computer vision

Understand image classification with convolutional neural networks, real-time object detection with YOLO, semantic segmentation, and generative image models including diffusion (DDPM, Stable Diffusion).

By the end of this module you will be able to:

Explain how convolutional neural networks extract hierarchical features from images
Describe the YOLO architecture and why single-shot detection enables real-time inference
Distinguish image classification, object detection, and semantic segmentation tasks and their outputs
Explain the forward (noising) and reverse (denoising) process in diffusion models

Autonomous vehicle sensor array and LIDAR system

Uber self-driving fatality, Tempe AZ, 18 March 2018

The car saw her but decided not to brake

On 18 March 2018, an Uber autonomous test vehicle struck and killed Elaine Herzberg as she walked her bicycle across a road in Tempe, Arizona. The NTSB investigation revealed that the vehicle's perception system detected Herzberg approximately 5.6 seconds before impact but repeatedly reclassified her: first as an unknown object, then as a vehicle, then as a bicycle, each reclassification resetting the prediction of her path.

The system's object detection model could detect her presence but could not maintain a consistent classification across frames. Worse, the software had been configured to suppress false positives by requiring a stable classification before initiating emergency braking. The safety driver was watching a video on her phone. The car never braked.

The incident exposed critical failures in computer vision system design: classification confidence thresholds were tuned for comfort rather than safety, object tracking did not persist across reclassification events, and the system had no fallback for ambiguous detections. Uber suspended autonomous testing for nine months. The NTSB cited inadequate safety culture as the probable cause.

Convolutional neural networks

Convolutional neural networks (CNNs) are the foundation of modern computer vision. A CNN processes an image through a series of convolutional layers, each applying learned filters (kernels) that detect increasingly complex patterns. Early layers detect edges and textures. Middle layers detect parts (eyes, wheels, corners). Deep layers detect complete objects and scenes.

A convolutional layer slides a small filter (typically 3x3 or 5x5 pixels) across the input image, computing a dot product at each position. This produces a feature map that highlights where the filter's pattern appears. Pooling layers (typically max pooling) downsample feature maps, reducing spatial dimensions while preserving the strongest activations. The alternation of convolution and pooling creates a hierarchy from local features to global understanding.

Key architectures include AlexNet (2012, proved deep CNNs work on ImageNet), VGGNet (2014, very deep with small 3x3 filters), ResNet (2015, residual connections enabling 152+ layers), and EfficientNet (2019, neural architecture search for optimal width/depth/resolution scaling). Transfer learning from pre-trained ImageNet models is standard practice: fine-tune a ResNet trained on 14 million images rather than training from scratch.

With an understanding of convolutional neural networks in place, the discussion can now turn to image classification, which builds directly on these foundations.

Image classification

Image classification assigns a single label to an entire image. The CNN extracts features through convolutional layers, then a fully connected classification head maps the final feature vector to a probability distribution over classes using softmax. The model outputs the class with the highest probability as its prediction.

ImageNet, the benchmark that drove the deep learning revolution, contains 14 million images across 1,000 classes. Human top-5 error rate is approximately 5.1%. ResNet surpassed this in 2015 with 3.6% top-5 error. Modern architectures like EfficientNet and Vision Transformers (ViT) achieve below 2%.

Vision Transformers (ViT, Dosovitskiy et al., 2020) apply the transformer architecture to images by splitting the image into fixed-size patches (e.g., 16x16 pixels), treating each patch as a token, and processing the sequence with standard transformer self-attention. ViT demonstrates that the transformer architecture generalises beyond text when given sufficient training data and compute.

With an understanding of image classification in place, the discussion can now turn to object detection: yolo, which builds directly on these foundations.

Object detection: YOLO

Object detection locates and classifies multiple objects within an image, outputting bounding boxes with class labels and confidence scores. Earlier approaches like R-CNN used a two-stage pipeline: propose regions, then classify each region. This was accurate but slow (approximately 0.5 FPS).

YOLO (You Only Look Once) reformulated object detection as a single regression problem. The network divides the image into a grid and predicts bounding boxes and class probabilities for each grid cell in a single forward pass. YOLOv1 (2016) ran at 45 FPS on a GPU. Current versions (YOLOv8, 2023) achieve modern accuracy while running at over 100 FPS.

Each prediction includes: bounding box coordinates (x, y, width, height), an objectness score (probability that the box contains any object), and class probabilities. Non-maximum suppression (NMS) removes duplicate detections by suppressing overlapping boxes with lower confidence scores. The mean Average Precision (mAP) metric evaluates detection accuracy across all classes at different intersection-over-union (IoU) thresholds.

With an understanding of object detection: yolo in place, the discussion can now turn to semantic segmentation, which builds directly on these foundations.

“We frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.”
Redmon et al., 'You Only Look Once: Unified, Real-Time Object Detection', 2016

Semantic segmentation

Semantic segmentation assigns a class label to every pixel in an image. Where object detection draws bounding boxes, segmentation produces pixel-precise masks. This is critical for applications where shape matters: autonomous driving (distinguish road from pavement from obstacle), medical imaging (delineate tumour boundaries), and satellite analysis (land use mapping).

Fully Convolutional Networks (FCN, Long et al., 2015) were the first to achieve end-to-end pixel-wise prediction by replacing the fully connected classification head with convolutional layers and using transposed convolutions to upsample feature maps back to the original image resolution. U-Net (2015) added skip connections between the encoder (downsampling) and decoder (upsampling) paths, preserving fine spatial details that are lost during pooling.

Instance segmentation (Mask R-CNN, 2017) extends semantic segmentation by distinguishing individual objects of the same class. Panoptic segmentation combines semantic and instance segmentation, classifying every pixel while separating individual object instances.

With an understanding of semantic segmentation in place, the discussion can now turn to diffusion models: ddpm and stable diffusion, which builds directly on these foundations.

Common misconception

“Object detection and image classification are the same task.”

Classification assigns one label to the entire image: 'this image contains a cat'. Object detection locates multiple objects with bounding boxes: 'there is a cat at coordinates (120, 45, 280, 210) with 94% confidence and a dog at (350, 100, 520, 300) with 87% confidence'. Semantic segmentation goes further, classifying every single pixel. These are three distinct tasks with different architectures, loss functions, and evaluation metrics.

Camera lens representing the perception systems used in computer vision — Computer vision systems must process visual data in real time for safety-critical applications. A single misclassification in an autonomous vehicle can be fatal.

Diffusion models: DDPM and Stable Diffusion

Diffusion models generate images by learning to reverse a gradual noising process. The forward process adds Gaussian noise to a training image over many timesteps until it becomes pure noise. The model learns to predict and remove the noise at each step, progressively recovering the original image structure.

Denoising Diffusion Probabilistic Models (DDPM, Ho et al., 2020) formalised this framework. The model is a U-Net conditioned on the timestep, trained to predict the noise that was added at each step. At inference time, the model starts from random noise and iteratively denoises it over hundreds of steps, producing a novel image from the learned distribution.

Stable Diffusion (Rombach et al., 2022) made diffusion practical by operating in a compressed latent space rather than pixel space. An encoder compresses the image to a lower-dimensional latent representation, diffusion occurs in this latent space (much cheaper computationally), and a decoder reconstructs the full image. Text conditioning is added through cross-attention with CLIP text embeddings, enabling text-to-image generation.

Classifier-free guidance (Ho and Salimans, 2022) controls the strength of text conditioning: higher guidance scales produce images that more closely match the text prompt at the cost of reduced diversity. A guidance scale of 7.5 is typical. At scale 1.0, the model largely ignores the text; at 20+, images become oversaturated and distorted.

With an understanding of diffusion models: ddpm and stable diffusion in place, the discussion can now turn to ethical considerations in computer vision, which builds directly on these foundations.

Ethical considerations in computer vision

Computer vision systems inherit and amplify biases present in training data. Facial recognition systems have demonstrated significantly higher error rates for darker-skinned individuals and women (Buolamwini and Gebru, 2018). ImageNet itself contains biased label distributions and harmful category labels. These biases propagate through transfer learning to downstream applications.

Generative models raise additional concerns: deepfakes can produce convincing fabricated images and video of real people, diffusion models can reproduce copyrighted content from training data, and text-to-image systems can generate harmful or misleading content at scale. Watermarking and provenance tracking (C2PA standard) are emerging as technical countermeasures, but the fundamental challenge of distinguishing synthetic from real visual content remains unsolved.

Loading interactive component...

Check your understanding

In the diffusion process used by Stable Diffusion, what happens during the forward process and what does the model learn to do during training?

The Uber autonomous vehicle detected Elaine Herzberg 5.6 seconds before impact but repeatedly reclassified her. Which computer vision failure mode does this represent?

Loading interactive component...

Key takeaways

CNNs extract hierarchical features through convolutional layers (edges to textures to objects). Transfer learning from pre-trained models like ResNet is standard practice for most vision tasks.
YOLO reformulated object detection as a single-pass regression problem, enabling real-time inference at 45+ FPS. Each detection includes bounding box coordinates, objectness score, and class probabilities.
Classification, detection, and segmentation are distinct tasks with increasing output granularity: one label per image, bounding boxes per object, and class labels per pixel respectively.
Diffusion models generate images by learning to reverse a gradual noising process. Stable Diffusion operates in latent space for efficiency and uses CLIP text embeddings for text-to-image conditioning.
Computer vision systems inherit training data biases. Safety-critical applications require temporal consistency, strong tracking, and conservative failure modes, as the Uber fatality demonstrated.
Vision transformers (ViTs) have overtaken CNNs on many benchmarks by treating image patches as token sequences. The trade-off is higher data requirements for pre-training, which makes transfer learning from large pre-trained checkpoints essential.

You can now explain how machines interpret and generate visual content, from classification to diffusion. Building AI systems is only half the challenge: getting them into production is the other half. How do you deploy, monitor, and maintain ML models in production? Module 13 covers deployment and MLOps.

Standards and sources cited in this module

Redmon et al., 'You Only Look Once: Unified, Real-Time Object Detection' (2016)
The YOLO paper that reformulated object detection as a single-pass regression problem, enabling real-time inference.
Ho et al., 'Denoising Diffusion Probabilistic Models' (DDPM, 2020)
Formalised the diffusion framework for image generation that underpins Stable Diffusion, DALL-E, and Midjourney.
Rombach et al., 'High-Resolution Image Synthesis with Latent Diffusion Models' (Stable Diffusion, 2022)
Introduced latent-space diffusion, making high-resolution image generation computationally practical.
NTSB, 'Collision Between Vehicle Controlled by Developmental ADS and Pedestrian, Tempe AZ' (2019)
The NTSB investigation report on the Uber autonomous vehicle fatality, documenting the perception system failures.
He et al., 'Deep Residual Learning for Image Recognition' (ResNet, 2015)
ResNet introduced residual connections enabling training of 152+ layer CNNs, still the backbone of most transfer learning.

Loading lesson...

The car saw her but decided not to brake

Convolutional neural networks

Image classification

Object detection: YOLO

Semantic segmentation

Diffusion models: DDPM and Stable Diffusion

Ethical considerations in computer vision

Key takeaways

CNNs extract hierarchical features through convolutional layers (edges to textures to objects). Transfer learning from pre-trained models like ResNet is standard practice for most vision tasks.

YOLO reformulated object detection as a single-pass regression problem, enabling real-time inference at 45+ FPS. Each detection includes bounding box coordinates, objectness score, and class probabilities.

Classification, detection, and segmentation are distinct tasks with increasing output granularity: one label per image, bounding boxes per object, and class labels per pixel respectively.

Diffusion models generate images by learning to reverse a gradual noising process. Stable Diffusion operates in latent space for efficiency and uses CLIP text embeddings for text-to-image conditioning.

Computer vision systems inherit training data biases. Safety-critical applications require temporal consistency, strong tracking, and conservative failure modes, as the Uber fatality demonstrated.

Vision transformers (ViTs) have overtaken CNNs on many benchmarks by treating image patches as token sequences. The trade-off is higher data requirements for pre-training, which makes transfer learning from large pre-trained checkpoints essential.