The More You Look, the More You'll Find

Multiple Comparison Corrections

Your advocacy organization just wrapped a big campaign push. The digital team tested a new action alert template across 20 different audience segments: long-time supporters, new sign-ups, lapsed contacts, people who only open emails on weekends, people who previously signed petitions but never donated, and so on. They ran a separate test for each segment comparing the old template to the new one. Eighteen segments showed no difference. But two came back with p-values below 0.05. The team is thrilled and wants to roll out the new template for those two groups. Before they do, ask this question: if nothing had actually changed, how many "significant" results would you expect to find by chance alone?

The answer is exactly one. With 20 tests at a threshold of 0.05, you'd expect 5% of them to come back significant purely from random variation. That's one out of twenty. Finding two isn't much more than what chance would predict. The problem gets worse as the number of tests grows. Run 20 tests where nothing is really going on, and the probability that at least one test crosses the significance threshold is about 64%. At 50 tests, it climbs above 92%. This isn't a subtle statistical footnote. It means that the more questions you ask of the same data, the more likely you are to "discover" something that isn't there.

This is the multiple comparisons problem, sometimes called the "look-elsewhere effect." Every hypothesis test carries a small risk of a Type I error, a false positive. When you run one test, that risk is whatever threshold you set, typically 5%. But when you run many tests, those individual risks accumulate. It's like rolling a twenty-sided die once and being surprised if it lands on one. Not very likely. But roll it twenty times, and the odds that you see at least one "one" are much higher.

Multiple comparison corrections are methods that adjust your significance threshold to account for the number of tests you're running, keeping the overall false positive rate under control. The simplest and most conservative is the Bonferroni correction. You divide your significance threshold by the number of tests. Running 20 tests? Your new threshold becomes 0.05 divided by 20, which is 0.0025. Only results with p-values below that stricter bar count as significant. It's straightforward and easy to explain to stakeholders, but it can be too aggressive. When you have many tests, the adjusted threshold becomes so strict that you might miss real effects, trading away statistical power for safety.

A more balanced alternative is the Benjamini-Hochberg procedure, which controls the false discovery rate rather than the overall error rate. Instead of preventing any false positive at all costs, it aims to keep the proportion of false positives among your significant results below a target, usually 5%. You rank all your p-values from smallest to largest, then compare each one to a threshold that gradually increases. The smallest p-value gets compared to 0.05 divided by the total number of tests. The second smallest gets compared to 0.05 times 2 divided by the total. And so on. This approach is less conservative than Bonferroni and tends to retain more genuine findings, making it popular in fields where you're testing many hypotheses simultaneously.

These corrections matter whenever your digital team is testing across multiple segments, comparing results across several campaigns, or running many A/B tests from a single experimental round. If your online fundraising team tests five different email subject lines and picks the "winner" without adjusting for multiple comparisons, they may just be selecting noise. If your campaign analytics team compares petition conversion rates across a dozen landing pages, the top performer may not actually be better. Any time a report highlights the "best" or "worst" result from a batch of comparisons, ask whether the significance threshold accounts for how many comparisons were made. In grant reporting, showing that a program worked for one particular subgroup out of many tested is much less convincing if the analysis didn't correct for looking at multiple subgroups.

A single test is a focused question. Running many tests without correction turns it into a fishing expedition where you're almost guaranteed to catch something, even when the pond is empty.

See It

Drag the slider to increase the number of tests, then click "Run Tests" to simulate. Toggle the Bonferroni correction to see how it changes which results cross the threshold. Run it several times and watch how often false positives appear.

Reflect

Think about the last time your team compared results across multiple segments, campaigns, or time periods and highlighted the "winner." Was the significance threshold adjusted for the number of comparisons, or was the team unknowingly fishing for noise?

If you applied Bonferroni correction to your most recent batch of A/B tests, would any of the "significant" results survive the stricter threshold? What would that tell you about how confidently you should act on those findings?

Multiple Comparison Corrections

See It

Reflect

Get new posts by email

Keep Reading

Big Enough to Matter

How Many People Do You Actually Need?