Type I and Type II Errors

Your digital campaigning team just finished an A/B test. They tried a new action page layout against the existing one, sending 3,000 supporters to each version. The new page had a 5.1% conversion rate compared to the old page's 4.4%. The p-value came back at 0.03, below the 0.05 threshold, so the team declared a winner and rolled out the new layout across all campaigns.

Three months later, the overall conversion rate hasn't budged. The "improvement" was a mirage. The test produced a statistically significant result, but the significance was a fluke. This is what statisticians call a Type I error, sometimes known as a false positive. You rejected the null hypothesis when you shouldn't have. You thought something changed, but nothing did.

Now picture the opposite scenario. A different team tested a new fundraising appeal against their standard version. The new appeal actually does raise more per donor, to the tune of about €3 extra on average. But the sample was small, just 200 donors per group, and the natural variation in donation amounts is large. The p-value comes back at 0.18. The team concludes there's no meaningful difference and shelves the new appeal. They just threw away a version that would have raised thousands more over the next year. This is a Type II error, sometimes known as a false negative. You failed to reject the null hypothesis when you should have. Something real was there, and you missed it.

These two errors pull in opposite directions. The harder you try to avoid one, the more likely you become to commit the other. If you set a very strict threshold for declaring significance, say p < 0.01 instead of p < 0.05, you'll rarely cry wolf. But you'll also miss a lot of genuine improvements because you're demanding too much evidence. If you loosen the threshold to p < 0.10, you'll catch more real effects, but you'll also act on noise more often.

Think of it as a tradeoff between caution and sensitivity. A cautious tester rarely makes false claims but frequently walks past real opportunities. A sensitive tester catches more real effects but also chases more phantoms. There's no free lunch. You can't eliminate both errors simultaneously unless you get much more data or the true effect is enormous.

The probability of a Type I error is simply your significance threshold, often called alpha. If you use the conventional 0.05 cutoff, you accept a 5% chance of a false positive on any given test. The probability of a Type II error is called beta. The flip side of beta is statistical power, the probability of correctly detecting a real effect when it exists. We'll explore power in detail in an upcoming entry. For now, the key point is that alpha and beta are connected. Lowering alpha (being stricter about false positives) raises beta (makes false negatives more likely) unless you compensate with a larger sample.

In campaign A/B testing, Type I errors lead you to adopt changes that don't actually help. Over time, you accumulate a false sense of optimization progress while your actual performance flatlines. Type II errors are quieter but just as costly. They cause you to discard genuinely better subject lines, petition page layouts, or donation form designs because your test wasn't powerful enough to detect the improvement. In grant evaluation, a Type I error means claiming a program worked when the evidence was just noise, which is embarrassing if a funder later scrutinizes the analysis. A Type II error means failing to demonstrate impact that genuinely exists, potentially costing future funding for a program that actually delivers results.

Every test is a balancing act between two kinds of mistakes. Understanding both, rather than obsessing over just the p-value, is what turns A/B testing from a ritual into a real decision-making tool.


See It

Drag the decision threshold left and right to see how Type I and Type II error rates change in opposite directions. Use the slider to adjust the effect size and watch how a bigger real difference makes both errors easier to avoid.


Reflect

Think about the A/B tests or campaign evaluations your organization has run. Which error would be more damaging in your context: acting on a result that turned out to be noise, or missing an improvement that was genuinely there? Does your team's testing practice lean more toward one type of mistake?

When you set the bar for "good enough evidence," are you explicitly choosing how much risk of each error type you're willing to accept, or are you just using 0.05 because that's what everyone uses?