Applied Data · Module 5
Probability and distributions (uncertainty without the panic)
Data work is mostly uncertainty management.
Previously
Data analysis and insight generation
Analysis is asking good questions of data and checking that the answers hold up.
This module
Probability and distributions (uncertainty without the panic)
Data work is mostly uncertainty management.
Next
Inference, sampling, and experiments
Inference is the art of learning about a bigger reality from limited observations.
Progress
Mark this module complete when you can explain it without rereading every paragraph.
Why this matters
If a pipeline succeeds 99% of the time, it still fails 1 day in 100.
What you will be able to do
- 1 Explain probability and distributions (uncertainty without the panic) in your own words and apply it to a realistic scenario.
- 2 A mean is not a system. Distributions show variability and risk.
- 3 Check the assumption "Variation matters" and explain what changes if it is false.
- 4 Check the assumption "Outliers are explained" and explain what changes if it is false.
Before you begin
- Foundations-level vocabulary and concepts
- Confidence with basic diagrams and section terminology
Common ways people get this wrong
- Mean worship. Optimising for the mean can harm people in the tails.
- Ignoring skew. Skewed distributions make intuitive assumptions wrong.
Main idea at a glance
Diagram
Stage 1
Define event
State clearly what event you are measuring and in what population.
I think vague event definitions hide the actual risk you are running.
Data work is mostly uncertainty management. Probability is how we stay honest about that. You do not need to love maths to use probability well. You need to be disciplined about what you are claiming.
Worked example. “It usually works” is not a reliability statement
Worked example. “It usually works” is not a reliability statement
If a pipeline succeeds 99% of the time, it still fails 1 day in 100. Over a year that is multiple failures. The question is not “is 99 good”. The question is “what happens on the failure days, and what does it cost”.
Common mistakes with probability
Probability failure patterns
These errors make reliable systems look safer than they are.
-
Percentage and probability mixed
Comparing 12 and 0.12 as if they were different truths creates bad calculations.
-
Rare treated as impossible
Low frequency events still dominate impact in many operational systems.
-
Normality assumed by default
Heavy-tail behaviour is common in outages, latency, and fraud patterns.
Verification. A simple sanity check
Probability sanity checks
Answer these before accepting reliability claims.
-
Expected count check
If probability is 1%, estimate expected events over 10,000 runs.
-
Sampling blind-spot check
If monitoring samples 1% of events, identify what failure types might be missed.
Mental model
Distributions shape decisions
A mean is not a system. Distributions show variability and risk.
-
1
Data
-
2
Distribution
-
3
Tails and outliers
-
4
Risk
Assumptions to keep in mind
- Variation matters. Variation is often the story. Averages can hide what users experience.
- Outliers are explained. Outliers can be errors or reality. Either way, they need attention.
Failure modes to notice
- Mean worship. Optimising for the mean can harm people in the tails.
- Ignoring skew. Skewed distributions make intuitive assumptions wrong.
Check yourself
Quick check. Probability and distributions
0 of 5 opened
What does probability help you do in data work
Stay honest about uncertainty and avoid overconfident claims from limited observations.
Scenario. A pipeline succeeds 99% of the time. Over a year, why can that still be painful
Because 1% failure still means multiple failures across many runs, and those failures can hit on high impact days.
What is a distribution
A description of how values are spread, not just the average.
Why can the mean be misleading
Outliers and skew can make the mean hide the typical experience.
What is one reason heavy tails matter in services
Rare slow or failing events can dominate user experience and cost, even if the average looks fine.
Artefact and reflection
Artefact
A one-page decision note with assumption, evidence, and chosen action
Reflection
Where in your work would explain probability and distributions (uncertainty without the panic) in your own words and apply it to a realistic scenario. change a decision, and what evidence would make you trust that change?
Optional practice
Work through one scenario and justify the decision with evidence