Module 5 of 24 · Foundations

Evaluating AI

35 min read 3 outcomes 1 interactive tool + drag challenge 5 standards cited

This is the fifth of 8 Foundations modules. You have trained a model from scratch in Module 4. Now the question shifts from "how do I build it?" to "how do I know it works?" The evaluation methods in this module apply to every model type you will encounter in the Applied and Practice & Strategy stages that follow (24 modules total).

By the end of this module you will be able to:

Calculate and interpret accuracy, precision, recall, and F1 score from a confusion matrix
Diagnose overfitting and underfitting from training curves and explain why each is dangerous
Explain why cross-validation produces more reliable estimates than a single train/test split

Close-up of a medical thermometer representing health data prediction failures

Real-world failure · 2013

Google predicted flu trends. It was 140% wrong.

In 2008, Google launched Flu Trends, a system that predicted influenza outbreaks by analysing search queries. It was lauded as a breakthrough in real-time epidemiology, published in Nature, and covered by every major news outlet.

By the 2012-2013 flu season, the model was estimating flu prevalence at nearly 2.4 times the rate reported by the Centers for Disease Control and Prevention (CDC). Researchers at Harvard and Northeastern later found that the model had been overfitting to seasonal search terms that correlated with flu during training but were actually driven by media coverage and winter search behaviour, not actual illness.

The model's training accuracy had been excellent. Its real-world performance was not. This gap, between how a model performs on data it has already seen and how it performs on new data, is the central problem of model evaluation.

A model shows 97% accuracy on training data. Should you deploy it?

Google Flu Trends is a case study in what happens when evaluation is an afterthought. The team had a model that fit the training data beautifully. What they did not have was a rigorous evaluation protocol that would have revealed the model's inability to generalise. This module gives you that protocol.

If the terms accuracy, precision, recall, and F1 are already familiar, use the knowledge checks to confirm your understanding and skip to Module 6: Deep learning architectures.

With the learning outcomes established, this module begins by examining the confusion matrix: where evaluation begins in depth.

5.1 The confusion matrix: where evaluation begins

Every classification model produces four types of outcome when it makes a prediction. Suppose you build a spam filter. For each email, the model predicts either "spam" or "not spam," and each prediction is either correct or incorrect. That gives you four cells:

True Positive (TP): the model predicted spam, and the email was actually spam. Correct.
True Negative (TN): the model predicted not spam, and the email was actually not spam. Correct.
False Positive (FP): the model predicted spam, but the email was actually legitimate. A false alarm. Your colleague's important email went to the spam folder.
False Negative (FN): the model predicted not spam, but the email was actually spam. A miss. The phishing email reached your inbox.

These four numbers are the raw material for every evaluation metric that follows. The confusion matrix is simply a 2×2 table arranging them. Before calculating any percentage, you should always look at the raw counts first. They tell you where the model is making mistakes and how often.

With an understanding of the confusion matrix: where evaluation begins in place, the discussion can now turn to accuracy and why it lies, which builds directly on these foundations.

5.2 Accuracy and why it lies

Accuracy is the simplest metric: the proportion of all predictions that were correct. It equals (TP + TN) / (TP + TN + FP + FN). If your spam filter correctly classifies 950 out of 1,000 emails, accuracy is 95%.

The problem arises when classes are imbalanced. Consider a medical test for a rare disease affecting 1% of the population. A model that always predicts "no disease" achieves 99% accuracy while catching zero actual cases. Accuracy tells you the model is almost always right; it hides the fact that it is useless for the task you care about.

“Accuracy is not a useful metric when the class distribution is highly skewed.”
Provost, F. & Fawcett, T., Data Science for Business (2013) - Chapter 8: Visualizing Model Performance
This observation underpins why practitioners moved beyond accuracy to metrics that separately evaluate performance on each class. The spam filter example and the rare disease scenario both demonstrate the same fundamental limitation.

Common misconception

“A model with 99% accuracy is a good model.”

Accuracy alone says nothing about which errors the model makes. In imbalanced datasets, a model can achieve very high accuracy by always predicting the majority class. You need precision and recall to understand performance on the class that matters. A cancer screening tool with 99% accuracy that misses every actual cancer patient is worse than useless: it is dangerous.

With an understanding of accuracy and why it lies in place, the discussion can now turn to precision, recall, and the f1 score, which builds directly on these foundations.

5.3 Precision, recall, and the F1 score

Precision answers: of all the items the model flagged as positive, how many were actually positive? Precision = TP / (TP + FP). High precision means few false alarms. If your spam filter has 98% precision, almost every email it sends to the spam folder genuinely is spam.

Recall (also called sensitivity or true positive rate) answers: of all the items that were actually positive, how many did the model catch? Recall = TP / (TP + FN). High recall means few misses. If your cancer screening has 95% recall, it catches 95 out of every 100 actual cancer cases.

There is an inherent tension between precision and recall. Making the model more aggressive (lowering its decision threshold) catches more positives (higher recall) but also flags more negatives incorrectly (lower precision). The reverse is also true. Which matters more depends entirely on the domain.

The F1 score is the harmonic mean of precision and recall: F1 = 2 × (Precision × Recall) / (Precision + Recall). It penalises extreme imbalances between the two. An F1 of 0.90 means both precision and recall are reasonably high; an F1 of 0.60 means at least one of them is poor.

With an understanding of precision, recall, and the f1 score in place, the discussion can now turn to choosing the right metric for the problem, which builds directly on these foundations.

Production dashboard tracking precision, recall, and F1 metrics over time to detect model drift — Model evaluation is not a one-time check. Production dashboards track precision, recall, and F1 over time to detect model drift, where performance degrades as the real world changes beneath the model.

5.4 Choosing the right metric for the problem

The choice between optimising for precision or recall is a domain decision, not a technical one. Two scenarios illustrate the difference:

Spam filtering (precision matters more). A false positive means a legitimate email disappears into the spam folder. The user misses a job offer, a medical appointment confirmation, or a legal notice. The cost of a false alarm is high. You want the model to be very confident before flagging something as spam.

Cancer screening (recall matters more). A false negative means a patient with cancer is told they are healthy. They do not receive treatment. The cost of a miss is catastrophic. You want the model to catch every possible case, even if it means some healthy patients receive follow-up tests they did not need.

There is no universal answer. The metric you optimise must reflect the real-world consequences of each type of error. This is why understanding the confusion matrix at a granular level matters: it forces you to confront the trade-offs explicitly rather than hiding behind a single accuracy number.

Common misconception

“The F1 score is always the best metric to use.”

F1 weights precision and recall equally. In many real applications, one type of error is far more costly than the other. A cancer screening system should optimise for recall even at the expense of precision. A criminal justice risk assessment should consider the asymmetric harm of false positives. F1 is a useful default when you have no reason to favour one over the other, but it is rarely the final answer.

With an understanding of choosing the right metric for the problem in place, the discussion can now turn to overfitting and underfitting, which builds directly on these foundations.

5.5 Overfitting and underfitting

Overfitting occurs when a model learns the training data too well, including its noise and random fluctuations, and fails to generalise to new data. The model memorises rather than learns. Training accuracy is high; validation accuracy is significantly lower. Google Flu Trends overfit to search patterns that happened to correlate with flu during training but did not hold up in subsequent years.

Underfitting occurs when a model is too simple to capture the underlying patterns in the data. Both training accuracy and validation accuracy are low. A linear model trying to fit a curved relationship underfits. It has not learned enough.

The goal is a model that sits between these extremes: complex enough to capture genuine patterns, simple enough to generalise. The primary diagnostic tool is comparing training loss and validation loss over training epochs. When training loss continues to decrease but validation loss starts rising, the model has begun overfitting.

“The most important single measure of how well a learning machine works is its generalisation performance, i.e. the prediction error over an independent test sample.”
Hastie, T., Tibshirani, R. & Friedman, J., The Elements of Statistical Learning (2009) - Chapter 7: Model Assessment and Selection
This principle, established in the foundational statistical learning text, is why we never evaluate a model solely on training data. Generalisation to unseen data is the only performance that matters in practice.

With an understanding of overfitting and underfitting in place, the discussion can now turn to cross-validation: a more reliable estimate, which builds directly on these foundations.

Loading interactive component...

5.6 Cross-validation: a more reliable estimate

A single train/test split is fragile. The particular examples that end up in the test set can make a model look better or worse than it truly is. Cross-validation addresses this by rotating which data serves as the test set.

In k-fold cross-validation, the dataset is divided into k equally sized folds (commonly k=5 or k=10). The model is trained k times. Each time, one fold is held out as the test set and the remaining k-1 folds are used for training. The final performance estimate is the average across all k runs. This gives you both a mean performance and a measure of variance: if performance swings wildly across folds, the model may be unstable or the dataset too small.

Cross-validation is more computationally expensive (you train k models instead of one) but it is the standard in any serious evaluation. When someone reports a result from a single train/test split, treat it with appropriate scepticism.

5.7 Check your understanding

A cancer screening model has 99% accuracy on a dataset where 1% of patients have cancer. It predicts 'no cancer' for every patient. What is its recall for the cancer class?

During training, you observe that training loss keeps decreasing but validation loss starts increasing after epoch 20. What is happening?

You are building a fraud detection system for a bank. Fraudulent transactions are 0.1% of all transactions. Which evaluation approach is most appropriate?

Loading interactive component...

Key takeaways

The confusion matrix (TP, TN, FP, FN) is the foundation of all classification evaluation. Always examine raw counts before calculating percentages. Each cell represents a different type of outcome with different real-world consequences.
Accuracy is misleading on imbalanced datasets. A model that always predicts the majority class can achieve high accuracy while being useless. Precision (how many flagged items are correct) and recall (how many actual positives are caught) separate what accuracy conflates.
The F1 score balances precision and recall but assumes equal importance. In practice, the cost of false positives and false negatives is rarely equal. Choose your metric based on domain consequences: recall for cancer screening, precision for spam filtering.
Overfitting (memorising training data) is diagnosed by a growing gap between training and validation performance. Underfitting (too simple a model) shows poor performance on both. The sweet spot generalises to unseen data.
Cross-validation provides more reliable performance estimates than a single train/test split by averaging over multiple rotations. It also reveals variance, which a single split hides. It is the standard for any credible evaluation.

Standards and sources cited in this module

Lazer, D. et al., 'The Parable of Google Flu: Traps in Big Data Analysis', Science (2014)
Full article
Primary academic analysis of Google Flu Trends' failure. Demonstrates how overfitting to correlated search terms produced predictions 140% higher than CDC estimates. Used as the opening case study.
Provost, F. & Fawcett, T., Data Science for Business (2013)
Chapter 8: Visualizing Model Performance
Establishes that accuracy is unreliable on skewed class distributions. Introduces the confusion matrix, precision, recall, and ROC curves in a business context accessible to non-specialists.
Hastie, T., Tibshirani, R. & Friedman, J., The Elements of Statistical Learning, 2nd ed. (2009)
Chapter 7: Model Assessment and Selection
Foundational statistical learning text. Provides the theoretical basis for cross-validation, the bias-variance trade-off, and why generalisation performance is the only metric that matters.
Sokolova, M. & Lapalme, G., 'A systematic analysis of performance measures for classification tasks', Information Processing & Management (2009)
Sections 3-5
thorough survey of 24 classification metrics. Establishes when each metric is appropriate and the mathematical relationships between them. Used for the precision-recall-F1 framework.
Kohavi, R., 'A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection', IJCAI (1995)
Full paper
Empirical comparison of cross-validation strategies. Establishes stratified 10-fold cross-validation as the recommended default based on bias and variance analysis across multiple datasets.

You can now evaluate any model rigorously. You know that accuracy lies on imbalanced data, that precision and recall capture different types of error, and that cross-validation produces estimates you can trust. The next question is: what kinds of model architectures produce the best results on different problems? Module 6 introduces convolutional and recurrent neural networks, the architectures that transformed computer vision and sequence modelling.

Previous: Neural networks from scratch Next: Deep learning architectures

Module 5 of 24 · AI Foundations

Loading lesson...

Module 5 of 24 · Foundations

Evaluating AI

35 min read 3 outcomes 1 interactive tool + drag challenge 5 standards cited

By the end of this module you will be able to:

Calculate and interpret accuracy, precision, recall, and F1 score from a confusion matrix
Diagnose overfitting and underfitting from training curves and explain why each is dangerous
Explain why cross-validation produces more reliable estimates than a single train/test split

Real-world failure · 2013

Google predicted flu trends. It was 140% wrong.

A model shows 97% accuracy on training data. Should you deploy it?

If the terms accuracy, precision, recall, and F1 are already familiar, use the knowledge checks to confirm your understanding and skip to Module 6: Deep learning architectures.

With the learning outcomes established, this module begins by examining the confusion matrix: where evaluation begins in depth.

5.1 The confusion matrix: where evaluation begins

True Positive (TP): the model predicted spam, and the email was actually spam. Correct.
True Negative (TN): the model predicted not spam, and the email was actually not spam. Correct.
False Positive (FP): the model predicted spam, but the email was actually legitimate. A false alarm. Your colleague's important email went to the spam folder.
False Negative (FN): the model predicted not spam, but the email was actually spam. A miss. The phishing email reached your inbox.

With an understanding of the confusion matrix: where evaluation begins in place, the discussion can now turn to accuracy and why it lies, which builds directly on these foundations.

5.2 Accuracy and why it lies

“Accuracy is not a useful metric when the class distribution is highly skewed.”
Provost, F. & Fawcett, T., Data Science for Business (2013) - Chapter 8: Visualizing Model Performance
This observation underpins why practitioners moved beyond accuracy to metrics that separately evaluate performance on each class. The spam filter example and the rare disease scenario both demonstrate the same fundamental limitation.

Common misconception

“A model with 99% accuracy is a good model.”

With an understanding of accuracy and why it lies in place, the discussion can now turn to precision, recall, and the f1 score, which builds directly on these foundations.

5.3 Precision, recall, and the F1 score

With an understanding of precision, recall, and the f1 score in place, the discussion can now turn to choosing the right metric for the problem, which builds directly on these foundations.

5.4 Choosing the right metric for the problem

The choice between optimising for precision or recall is a domain decision, not a technical one. Two scenarios illustrate the difference:

Common misconception

“The F1 score is always the best metric to use.”

With an understanding of choosing the right metric for the problem in place, the discussion can now turn to overfitting and underfitting, which builds directly on these foundations.

5.5 Overfitting and underfitting

“The most important single measure of how well a learning machine works is its generalisation performance, i.e. the prediction error over an independent test sample.”
Hastie, T., Tibshirani, R. & Friedman, J., The Elements of Statistical Learning (2009) - Chapter 7: Model Assessment and Selection
This principle, established in the foundational statistical learning text, is why we never evaluate a model solely on training data. Generalisation to unseen data is the only performance that matters in practice.

With an understanding of overfitting and underfitting in place, the discussion can now turn to cross-validation: a more reliable estimate, which builds directly on these foundations.

Loading interactive component...

5.6 Cross-validation: a more reliable estimate

5.7 Check your understanding

A cancer screening model has 99% accuracy on a dataset where 1% of patients have cancer. It predicts 'no cancer' for every patient. What is its recall for the cancer class?

During training, you observe that training loss keeps decreasing but validation loss starts increasing after epoch 20. What is happening?

You are building a fraud detection system for a bank. Fraudulent transactions are 0.1% of all transactions. Which evaluation approach is most appropriate?

Loading interactive component...

Key takeaways

The confusion matrix (TP, TN, FP, FN) is the foundation of all classification evaluation. Always examine raw counts before calculating percentages. Each cell represents a different type of outcome with different real-world consequences.
Accuracy is misleading on imbalanced datasets. A model that always predicts the majority class can achieve high accuracy while being useless. Precision (how many flagged items are correct) and recall (how many actual positives are caught) separate what accuracy conflates.
The F1 score balances precision and recall but assumes equal importance. In practice, the cost of false positives and false negatives is rarely equal. Choose your metric based on domain consequences: recall for cancer screening, precision for spam filtering.
Overfitting (memorising training data) is diagnosed by a growing gap between training and validation performance. Underfitting (too simple a model) shows poor performance on both. The sweet spot generalises to unseen data.
Cross-validation provides more reliable performance estimates than a single train/test split by averaging over multiple rotations. It also reveals variance, which a single split hides. It is the standard for any credible evaluation.

Standards and sources cited in this module

Lazer, D. et al., 'The Parable of Google Flu: Traps in Big Data Analysis', Science (2014)
Full article
Primary academic analysis of Google Flu Trends' failure. Demonstrates how overfitting to correlated search terms produced predictions 140% higher than CDC estimates. Used as the opening case study.
Provost, F. & Fawcett, T., Data Science for Business (2013)
Chapter 8: Visualizing Model Performance
Establishes that accuracy is unreliable on skewed class distributions. Introduces the confusion matrix, precision, recall, and ROC curves in a business context accessible to non-specialists.
Hastie, T., Tibshirani, R. & Friedman, J., The Elements of Statistical Learning, 2nd ed. (2009)
Chapter 7: Model Assessment and Selection
Foundational statistical learning text. Provides the theoretical basis for cross-validation, the bias-variance trade-off, and why generalisation performance is the only metric that matters.
Sokolova, M. & Lapalme, G., 'A systematic analysis of performance measures for classification tasks', Information Processing & Management (2009)
Sections 3-5
thorough survey of 24 classification metrics. Establishes when each metric is appropriate and the mathematical relationships between them. Used for the precision-recall-F1 framework.
Kohavi, R., 'A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection', IJCAI (1995)
Full paper
Empirical comparison of cross-validation strategies. Establishes stratified 10-fold cross-validation as the recommended default based on bias and variance analysis across multiple datasets.

Previous: Neural networks from scratch Next: Deep learning architectures

Module 5 of 24 · AI Foundations