Applied · Module 3

Evaluation, metrics and failure analysis

Accuracy is an easy number to like because it feels clean.

48 min 4 outcomes AI Intermediate

Previously

Data, features and representation

Raw data is rarely ready for a model.

This module

Evaluation, metrics and failure analysis

Accuracy is an easy number to like because it feels clean.

Next

Deployment, monitoring and drift

Deployment is where good models go to die.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

Glossary Tip.

What you will be able to do

1 Explain evaluation, metrics and failure analysis in your own words and apply it to a realistic scenario.
2 Evaluation is not one metric. It is evidence that the system is safe enough for the decision.
3 Check the assumption "Metrics match harms" and explain what changes if it is false.
4 Check the assumption "You inspect failures" and explain what changes if it is false.

Before you begin

Foundations-level vocabulary and concepts
Confidence with basic diagrams and section terminology

Common ways people get this wrong

Good average, bad edge cases. The average score can hide failures that matter most.
Threshold optimism. Thresholds set too aggressively create a system that looks good on paper and harms users.

Main idea at a glance

Model evaluation in practice

Separate data, test honestly, then monitor in the real world.

Stage 1

Training set

The data the model learns from. The model will always look good here.

I think treating training loss as proof of quality is a common mistake that leads to production failures.

Accuracy is an easy number to like because it feels clean. The problem is that it hides what you actually care about. In a spam filter, you can get high accuracy by declaring "not spam" for almost everything, because most email is not spam. The model looks great on paper and useless in practice.

Interactive lab

Glossary Tip

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

What you want depends on the job. If you are blocking fraud, a false negative means you miss a bad transaction. If you are flagging innocent customers, a false positive means you cause real harm. Evaluation is choosing what kind of mistake is acceptable and proving the system is making the right trade.

Two scores matter. One is how the model behaves on the data it learned from, and the other is how it behaves on data it has never seen. Training performance is often optimistic because the model can memorise. Real world performance is harder because the inputs change and the environment changes. Good evaluation separates these on purpose.

For classification problems, two practical metrics are precision and recall. Precision answers the question "when the model says positive, how often is it right". Recall answers "of the real positives, how many did it catch". A spam filter with high recall catches most spam, but it might also block legitimate email. A fraud detector with high precision avoids annoying customers, but it might miss attacks.

Interactive lab

Glossary Tip

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

Interactive lab

Glossary Tip

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

For regression problems, you are predicting a number, like delivery time or house price. Here the question becomes "how far off are we". Metrics like mean absolute error are popular because they map to a simple story of average miss distance. Even without the formula, the idea is to measure error in the same units your users experience.

Evaluation also needs a sanity check for overfitting and underfitting. Overfitting is when training looks great and real performance drops. Underfitting is when both are poor because the model cannot learn the pattern. The fix is rarely "more metrics". It is usually better data, better representation, or a simpler model that is easier to trust.

Interactive lab

Glossary Tip

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

Finally, production systems fail quietly. You can ship a model that passes every offline test and still break in the real world because the input distribution shifts. A spam campaign changes writing style. A fraud ring adapts. A new product changes customer behaviour. The model is not wrong in a dramatic way. It is just slowly less useful.

Interactive lab

Glossary Tip

This module includes an interactive practice component. Open the deeper tool or workspace step when you want to test the idea rather than only read it.

This is why evaluation is not a one time exam. It is a lifecycle. You validate before release, you test on untouched data, and you keep watching after deployment. If the system starts drifting, you want to notice early and know what to do next.

The human cost of metrics (yes, it matters)

Technical people sometimes treat metrics like they are morally neutral. They are not. A false positive can be an annoyed customer, a delayed service, or a person wrongly flagged. A false negative can be money lost, harm missed, or abuse allowed through. When I say “match the metric to the decision”, I mean match it to the harm profile and the workload you are creating for humans.

I also want you to respect baselines. If a human process already catches 95 percent of bad cases, your model must beat that in a way that reduces harm, not just moves it around. If the model makes the review queue unbearable, it will be disabled. I have seen this happen more times than I would like.

Evaluation: good, bad, best practice

Good practice: Pick a baseline and measure against it. Then write what you will do when the model is wrong. That sentence forces clarity.
Bad practice: Optimising a metric because it looks good on a slide. Slides do not deal with angry users or regulators. Your team does.
Best practice: Create a small failure catalogue. A handful of concrete examples of bad outcomes, plus the mitigation, plus the monitoring signal. This becomes an operational asset, not a one-off report.

CPD evidence prompt (copy friendly)

Write this as a short note for your CPD log. Keep it honest and specific.

CPD note template

What I studied: Training dynamics, feature representation, evaluation metrics, and why offline accuracy is not enough for production decisions.
What I practised: I used at least one practice activity to observe training behaviour and wrote down one failure mode I could now recognise earlier.
What changed in my practice: I now define the decision and the failure cost before selecting metrics, and I keep a baseline comparison so improvements are real.
Evidence artefact: One evaluation plan with metrics, thresholds, a baseline, and a fallback action if performance drifts.

Mental model

Evaluation is evidence

Evaluation is not one metric. It is evidence that the system is safe enough for the decision.

1

Goal
2

Metrics
3

Thresholds
4

Review failures

Assumptions to keep in mind

Metrics match harms. Accuracy can be the wrong measure. Choose measures that match the harm model.
You inspect failures. A score alone hides what breaks. Look at errors in context.

Failure modes to notice

Good average, bad edge cases. The average score can hide failures that matter most.
Threshold optimism. Thresholds set too aggressively create a system that looks good on paper and harms users.

Check yourself

Quick check on evaluation and metrics

0 of 10 opened

Why can accuracy be misleading in a spam filter

Because most email is not spam, so a model can get high accuracy while missing the spam you care about.

What does accuracy measure

The fraction of predictions that are correct overall.

What is precision in plain terms

When the model flags something, how often it is truly a positive.

What is recall in plain terms

Of the real positives, how many the model successfully catches.

Scenario. False positives are expensive and embarrassing. Which direction do you usually tune, precision or recall

Precision. You would rather flag fewer items but be right more often, then use human review or a second stage for borderline cases.

Why can training performance look better than real world performance

The model can memorise training data and real inputs can differ from what it saw.

What is a common sign of overfitting

Strong training results but worse results on new or held out data.

What is distribution shift

When real inputs differ from the training and test data the model was evaluated on.

Why must evaluation match the real world use

Because different mistakes have different costs, and the right metric depends on the job.

Name one silent production failure mode

Drift in inputs, changing user behaviour, or attackers adapting over time.

Artefact and reflection

Artefact

A one-page decision note with assumption, evidence, and chosen action

Reflection

Where in your work would explain evaluation, metrics and failure analysis in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?

Optional practice

Pick a baseline and measure against it. Then write what you will do when the model is wrong. That sentence forces clarity.