Statistical Power
Your campaigning team spent two weeks crafting a new petition page. They swapped the headline, rewrote the call to action, and tightened the supporter testimony. Then they ran an A/B test, splitting 300 visitors between the old page and the new one. The old page converted at 4.2%. The new page hit 5.1%. That's nearly a full percentage point of improvement. But the p-value came back at 0.31, well above the conventional 0.05 threshold. "No significant difference," the analyst reported. The team shelved the new page and went back to the drawing board.
Here's the thing. The new page might genuinely be better. A 0.9 percentage-point improvement on a petition page that sees 20,000 visitors a month would mean 180 more signatures every month. But with only 150 people per group, the test never had a realistic chance of detecting an effect that size. The test wasn't powerful enough.
Statistical power is the probability that your test will correctly detect a real effect when one exists. It's the flip side of the Type II error we talked about yesterday. If beta is the chance of missing a real effect, power is one minus beta. A test with 80% power will catch a real effect 80% of the time and miss it 20% of the time. That petition page test? With 150 per group and a true difference of 0.9 percentage points, the power was around 12%. The team was essentially flipping a coin weighted against them.
Four things determine how much power a test has. The first is sample size. More data means less noise, which makes it easier to spot a real signal. The second is the true effect size, meaning how big the actual difference is. A petition page that's 5 percentage points better than the old one will show up clearly even in a small sample. A page that's 0.5 points better needs thousands of visitors to detect. The third is the significance threshold. If you use a stricter threshold like 0.01 instead of 0.05, you need more evidence to declare significance, which reduces your power unless you increase your sample. The fourth is the natural variability in the data. If donation amounts range from €5 to €5,000, you'll need far more donors to detect a difference in average gifts than if everyone gives between €20 and €50.
The practical lever you have the most control over is sample size. You can't make the true effect bigger (that's up to reality). You usually don't want to loosen your significance threshold (that raises your Type I error rate). And you often can't reduce natural variability. But you can decide how long to run a test and how many supporters to include.
This is why power analysis should happen before you run a test, not after. Before launching an A/B test on your fundraising appeal, ask yourself what the smallest meaningful improvement would be. Maybe you'd only bother switching if the new version raises at least €2 more per donor. Then calculate how many donors you need per group to have an 80% chance of detecting that difference. If the answer is 2,000 per group and you only have 400 donors to work with, you know in advance that the test isn't worth running at that sample size. You'll almost certainly get a "not significant" result regardless of whether the new version works.
In email A/B testing, underpowered tests are everywhere. A campaign team splits a 1,000-person list to test two subject lines, sees a 1-point difference in open rates, and concludes "no winner." But the test had maybe 15% power for that effect size. They'd need 10,000 or more per group to reliably detect a 1-point difference. In petition page optimization, running tests with a few hundred visitors per variant means you'll only detect large effects of 3 or more percentage points. Subtler improvements, the kind that compound over months, will slip through. In grant evaluation, an underpowered program evaluation can fail to show impact even when the program genuinely works, threatening future funding for effective interventions.
A "not significant" result doesn't mean nothing happened. It might mean your test wasn't looking hard enough. Power analysis before the test tells you whether the question is even worth asking at your current sample size.
See It
Drag the effect size slider to see how the required sample size changes. The curve shows how power grows with more data. The dashed line marks 80% power, the conventional target.
Reflect
Think about the last A/B test your team ran that came back "not significant." How many people were in each group, and how large was the observed difference? Is it possible the test simply didn't have enough power to detect a real improvement?
Before your next test, try estimating the smallest difference that would actually matter for your organization. Then look at whether your available sample size gives you a realistic chance of detecting it. Would the answer change what you decide to test?