Applied Data · Module 5

Probability and distributions (uncertainty without the panic)

Data work is mostly uncertainty management.

20 min 4 outcomes Data Intermediate

Previously

Data analysis and insight generation

Analysis is asking good questions of data and checking that the answers hold up.

This module

Probability and distributions (uncertainty without the panic)

Data work is mostly uncertainty management.

Next

Inference, sampling, and experiments

Inference is the art of learning about a bigger reality from limited observations.

Progress

Mark this module complete when you can explain it without rereading every paragraph.

Why this matters

If a pipeline succeeds 99% of the time, it still fails 1 day in 100.

What you will be able to do

  • 1 Explain probability and distributions (uncertainty without the panic) in your own words and apply it to a realistic scenario.
  • 2 A mean is not a system. Distributions show variability and risk.
  • 3 Check the assumption "Variation matters" and explain what changes if it is false.
  • 4 Check the assumption "Outliers are explained" and explain what changes if it is false.

Before you begin

  • Foundations-level vocabulary and concepts
  • Confidence with basic diagrams and section terminology

Common ways people get this wrong

  • Mean worship. Optimising for the mean can harm people in the tails.
  • Ignoring skew. Skewed distributions make intuitive assumptions wrong.

Main idea at a glance

Diagram

Stage 1

Define event

State clearly what event you are measuring and in what population.

I think vague event definitions hide the actual risk you are running.

Data work is mostly uncertainty management. Probability is how we stay honest about that. You do not need to love maths to use probability well. You need to be disciplined about what you are claiming.

Worked example. “It usually works” is not a reliability statement

Worked example. “It usually works” is not a reliability statement

If a pipeline succeeds 99% of the time, it still fails 1 day in 100. Over a year that is multiple failures. The question is not “is 99 good”. The question is “what happens on the failure days, and what does it cost”.

Common mistakes with probability

Probability failure patterns

These errors make reliable systems look safer than they are.

  1. Percentage and probability mixed

    Comparing 12 and 0.12 as if they were different truths creates bad calculations.

  2. Rare treated as impossible

    Low frequency events still dominate impact in many operational systems.

  3. Normality assumed by default

    Heavy-tail behaviour is common in outages, latency, and fraud patterns.

Verification. A simple sanity check

Probability sanity checks

Answer these before accepting reliability claims.

  1. Expected count check

    If probability is 1%, estimate expected events over 10,000 runs.

  2. Sampling blind-spot check

    If monitoring samples 1% of events, identify what failure types might be missed.

Mental model

Distributions shape decisions

A mean is not a system. Distributions show variability and risk.

  1. 1

    Data

  2. 2

    Distribution

  3. 3

    Tails and outliers

  4. 4

    Risk

Assumptions to keep in mind

  • Variation matters. Variation is often the story. Averages can hide what users experience.
  • Outliers are explained. Outliers can be errors or reality. Either way, they need attention.

Failure modes to notice

  • Mean worship. Optimising for the mean can harm people in the tails.
  • Ignoring skew. Skewed distributions make intuitive assumptions wrong.

Check yourself

Quick check. Probability and distributions

0 of 5 opened

What does probability help you do in data work

Stay honest about uncertainty and avoid overconfident claims from limited observations.

Scenario. A pipeline succeeds 99% of the time. Over a year, why can that still be painful

Because 1% failure still means multiple failures across many runs, and those failures can hit on high impact days.

What is a distribution

A description of how values are spread, not just the average.

Why can the mean be misleading

Outliers and skew can make the mean hide the typical experience.

What is one reason heavy tails matter in services

Rare slow or failing events can dominate user experience and cost, even if the average looks fine.

Artefact and reflection

Artefact

A one-page decision note with assumption, evidence, and chosen action

Reflection

Where in your work would explain probability and distributions (uncertainty without the panic) in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?

Optional practice

Work through one scenario and justify the decision with evidence

Source DAMA DMBOK 2 (Data Management Body of Knowledge, 2nd Edition)
Source ISO/IEC 11179 metadata registries
Source ISO/IEC 27701:2025 privacy information management
Source ICO data protection principles and UK GDPR guidance