Module 17 of 26 · Applied

Inference and experiments

15 min read 3 outcomes Interactive + drag challenge 5 standards cited

By the end of this module you will be able to:

  • Design a valid A/B test with proper controls
  • Explain p-values and their limitations
  • Describe common experimental pitfalls
Email inbox on a screen representing A/B testing of messaging

Real-world experiment · 2012

Obama's campaign raised $60 million with a single A/B test on an email subject line.

During the 2012 US presidential campaign, the Obama digital team ran rigorous A/B tests on email subject lines. The informal "Hey" outperformed polished, professional alternatives by a 49% higher open rate, generating an estimated $60 million in additional donations.

The team tested 18 variations with randomised assignment to segments of equal size. Without controlled experimentation, they would have chosen the "professional" option and left tens of millions of dollars on the table. The previous module covered how to interpret statistical patterns. This module covers how to create them deliberately through experiments.

The winning subject line was 'Hey.' It outperformed polished alternatives by 49%. Would you have predicted that?

Experimentation is how we move from correlation to causation. An A/B test (randomised controlled trial in a business context) isolates the effect of a single variable by comparing a treatment group to a control group. Everything else is held constant. The difference in outcomes can be attributed to the variable you changed.

With the learning outcomes established, this module begins by examining designing a valid a/b test in depth.

17.1 Designing a valid A/B test

Five requirements make an A/B test valid:

  1. Random assignment: participants are randomly assigned to treatment or control. Non-random assignment introduces selection bias.
  2. Single variable: only one thing differs between groups. Changing the headline AND the button colour simultaneously means you cannot attribute the effect to either one.
  3. Adequate sample size: small samples produce noisy results. A power calculation before the test determines the minimum sample needed to detect a meaningful effect.
  4. Pre-registered hypothesis: decide what you are measuring and what counts as success before running the test. Changing the metric after seeing results is p-hacking.
  5. Sufficient duration: run the test long enough to capture natural variation (weekday vs weekend, payday effects, seasonal patterns). Stopping early because one variant looks good is a common error.

The only way to discover the limits of the possible is to go beyond them into the impossible.

Ronald Fisher, Statistical Methods and Scientific Inference (1956) - Chapter 2

Fisher established the foundations of experimental design, including randomisation, replication, and blocking. Modern A/B testing is a direct descendant of Fisher's agricultural experiments at Rothamsted.

Common misconception

We ran the test for two days and variant B is winning. Let us launch it.

Two days is rarely long enough to capture natural variation. Weekend vs weekday behaviour, payday effects, and seasonal patterns can all produce misleading short-term results. A power calculation before the test determines the minimum duration. Stopping early based on interim results inflates the false positive rate.

With an understanding of designing a valid a/b test in place, the discussion can now turn to understanding p-values, which builds directly on these foundations.

A/B test results requiring statistical significance before acting to avoid early stopping that inflates false positives
A/B test results require statistical significance before acting. Stopping early based on interim results inflates false positives.

17.2 Understanding p-values

A p-value answers the question: "If there were no real difference between groups, how likely is it that we would observe a result this extreme by chance alone?" A p-value of 0.03 means there is a 3% probability of seeing this result (or more extreme) if the null hypothesis (no real difference) were true.

The conventional threshold is p < 0.05, but this is arbitrary, not sacred. A p-value of 0.049 is not meaningfully different from 0.051. The American Statistical Association published a statement in 2016 warning against the binary interpretation of p-values and the practice of p-hacking (running many tests and reporting only the significant ones).

Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

American Statistical Association (2016) - ASA Statement on Statistical Significance and P-Values, Principle 3

The ASA statement was unprecedented: a professional body publicly correcting widespread misuse of its own methodology. It emphasised that p-values do not measure the probability that the hypothesis is true, and that context, study design, and effect size all matter alongside p-values.

Common misconception

A p-value of 0.01 means there is a 1% chance the result is wrong.

A p-value of 0.01 means: if there were no real effect, there is a 1% chance of observing a result this extreme. It does not tell you the probability that the effect is real. That requires Bayesian analysis, which incorporates prior knowledge. The distinction matters: a p-value of 0.01 from a poorly designed experiment with a tiny sample is much less convincing than a p-value of 0.04 from a well-powered, pre-registered study.

P-values frequently misinterpreted; the ASA 2016 statement corrected decades of binary thinking about statistical significance
P-values are widely used but frequently misinterpreted. The ASA's 2016 statement corrected decades of binary thinking about statistical significance.
Loading interactive component...
17.3 Check your understanding

A product team runs an A/B test on two homepage designs. After three days, Design B has a 12% higher click-through rate than Design A. The team has 4,000 visitors in each group. The product manager wants to launch Design B immediately. What is the problem?

A study reports a p-value of 0.04. The researcher says this means there is a 96% probability that the treatment works. What is the error?

A marketing team tests five different email subject lines simultaneously. One achieves p = 0.03. They declare it the winner. What experimental pitfall is this?

Loading interactive component...

Key takeaways

  • A valid A/B test requires: random assignment, a single variable changed, adequate sample size (from a power calculation), a pre-registered hypothesis, and sufficient duration to capture natural variation.
  • P-values measure how likely the observed data is under the null hypothesis. They do not measure the probability that your hypothesis is true. The ASA's 2016 statement warned against binary interpretation.
  • Stopping a test early because interim results look good inflates the false positive rate from 5% to approximately 25%. Complete the pre-planned duration.
  • Testing multiple variants simultaneously creates the multiple comparisons problem. With five variants at alpha = 0.05, the chance of at least one false positive is approximately 23%. Apply the Bonferroni correction.

Standards and sources cited in this module

  1. American Statistical Association (2016), 'ASA Statement on Statistical Significance and P-Values'

    Full statement

    Corrects widespread p-value misinterpretation. Principle 3 cited directly in Section 17.2.

  2. Fisher, R.A. (1956). Statistical Methods and Scientific Inference

    Chapters 2-3

    Foundational text establishing randomisation, replication, and experimental design principles.

  3. Kohavi, R., Tang, D., and Xu, Y. (2020). Trustworthy Online Controlled Experiments

    Chapters 1-5

    Modern reference for A/B testing at scale. Covers power analysis, early stopping, and multiple comparisons.

  4. Obama 2012 campaign A/B testing, reported by Optimizely (2013)

    Case study

    Opening case: the 'Hey' subject line that generated $60 million in additional donations through rigorous A/B testing.

  5. DAMA-DMBOK2 (2017)

    Chapter 14, Data Science

    Framework for experimental analytics within data management practice.

Module 17 of 26 · Applied Data