How Surprised Should You Be?

p-values

In the last entry, your team's redesigned action alert template got a 7.5% click-to-action rate versus the old template's 6.8% baseline. We said the 7.5% result "sits right at the upper boundary" of what you'd expect under the null hypothesis. That language was deliberately vague. "Right at the boundary" isn't precise enough to make a decision. Should you roll out the new template or not? You need a number, and that number is the p-value.

The p-value is the probability of seeing a result at least as extreme as yours, assuming the null hypothesis is true. That phrasing matters, so read it once more. It's not the probability that the null hypothesis is true. It's not the probability that your result is real. It's the probability of seeing data like yours in a world where nothing changed.

Back to the action alert. If the true click rate is 6.8% and you send to 5,000 people, random variation alone will produce different observed rates every time. Sometimes 6.5%, sometimes 7.1%, occasionally something more extreme. The p-value asks a specific question about that randomness. In the world where both templates perform identically, how often would you see a gap as large as the one between 6.8% and 7.5%? The answer, calculated through the normal distribution, is about 0.049. That means if nothing had actually changed, you'd see a difference this large roughly 5 times out of 100.

Is that rare enough? The conventional threshold is 0.05, and scientists call results below it statistically significant. Your result of 0.049 just barely clears that bar. This is not a resounding triumph. It's a photo finish. If the observed rate had been 8.2% instead, the p-value would drop below 0.001, meaning you'd see a difference that large less than once in a thousand tests. That's the difference between "probably real" and "almost certainly real."

The 0.05 threshold is a convention, not a law of nature. It was popularized in the early 20th century and stuck around because everyone agreed to use it, not because it captures anything fundamental about evidence. Some fields use 0.01. Particle physics uses 0.0000003. The threshold you choose should depend on the stakes. If you're deciding which email subject line to use next Tuesday, a p-value of 0.04 might be good enough. If you're recommending that your organization invest six months in rebuilding its entire action alert infrastructure, you'd want much stronger evidence.

Here's where people get tripped up. A small p-value tells you a result is unlikely under the null hypothesis. It does not tell you the result is large, important, or worth acting on. If you send a campaign email to 500,000 supporters, even a tiny improvement of 0.05 percentage points can produce a highly significant p-value. Statistical significance is not the same as practical significance. The p-value answers "is this likely real?" not "is this worth caring about?" We'll tackle that second question when we get to effect size.

The other common mistake is treating the p-value as the probability that your hypothesis is wrong. A p-value of 0.03 does not mean there's a 3% chance the null is true. That kind of backward reasoning confuses what the p-value actually measures. As we saw with Bayes' theorem, the probability of a hypothesis given your data requires more information than the p-value alone can provide.

In A/B testing of campaign emails, the p-value tells you whether the difference in click rates between two versions is likely real or just random variation in who opened when. When you see a spike in petition signatures after a social media push, the p-value quantifies whether that spike exceeds what you'd expect from normal daily fluctuation. In online fundraising, if average donation size increased after you redesigned your donation page, the p-value separates a genuine design effect from the natural ups and downs of who donates on any given day.

A small p-value means the difference is probably real. It says nothing about whether the difference is big enough to matter. "Statistically significant" tells you something happened, not that you should care.

See It

Drag the observed result line to see how the p-value changes. Use the slider to adjust sample size and watch how larger samples make the same difference more significant.

Reflect

Think about the last A/B test or campaign comparison your team ran. Did anyone report a p-value, or was the decision based on which number looked bigger? What might have changed if you'd calculated how often that result could appear by chance?

When you hear that a result is "statistically significant," do you ask how significant? A p-value of 0.049 and a p-value of 0.0001 are both "significant," but they represent very different levels of evidence. How would that distinction change your confidence in acting on the result?

p-values

See It

Reflect

Keep Reading

What If Nothing Changed?

When the Full List Wasn't Enough