Effect Size

Your digital campaigning team just finished an A/B test on an email action alert. They sent two versions to 250,000 supporters each. Version B got a click-to-action rate of 3.42%, compared to Version A's 3.38%. The p-value came back at 0.02. Statistically significant. The team lead is ready to declare Version B the winner and roll it out for the next campaign cycle. But stop for a moment. That difference is 0.04 percentage points. On a list of 250,000, that's 100 extra clicks. Is that worth the effort of maintaining two templates and running future tests against this new "champion"?

This is the gap that p-values can't close on their own. A p-value tells you whether a difference is probably real. It says nothing about whether the difference is large enough to act on. With a huge sample, even a trivially small difference can reach significance. Effect size is the measurement that fills this gap. It quantifies how big the difference is, not just whether it exists.

The simplest version of effect size is the raw difference itself. Version B beats Version A by 0.04 percentage points. That's an effect size expressed in the original units of your measurement. It's intuitive and easy to communicate, but it has a limitation. If someone tells you their fundraising appeal increased average donations by €3, you can't immediately tell whether that's impressive or negligible without knowing whether the typical donation is €15 or €500. A €3 increase on a €15 average is a 20% jump. The same €3 on a €500 average is barely a rounding error.

Standardized effect sizes solve this problem by expressing the difference relative to the variability in the data. The most common one is Cohen's d, which divides the difference between two group means by the pooled standard deviation. If the average donation in Group A is €42 and in Group B is €48, and the standard deviation across both groups is €20, then Cohen's d is 6 divided by 20, which gives you 0.3. The difference is three-tenths of a standard deviation. That single number lets you compare effects across completely different contexts. An effect of 0.3 in email open rates and an effect of 0.3 in petition conversion rates represent the same relative magnitude, even though the raw numbers look nothing alike.

Jacob Cohen, the statistician who popularized these measures, offered rough benchmarks. A d of 0.2 is considered small, 0.5 is medium, and 0.8 is large. These benchmarks are widely used but they're guidelines, not rules. In digital campaigning, where you're optimizing at scale, a "small" effect of 0.2 applied to 500,000 email recipients can translate into thousands of extra petition signatures or action completions. Context determines what counts as meaningful.

This connects directly to the statistical power and sample size calculations from the last two entries. When you plan an A/B test, the effect size you want to detect is one of the key inputs. If you're hoping to detect a small effect (d = 0.2), you'll need a much larger sample than if you're looking for a large one (d = 0.8). An honest conversation about effect size before launching a test forces your team to articulate what "meaningful improvement" actually means. Not just "any improvement" but an improvement large enough to justify the costs of implementation.

In email A/B testing, reporting the raw difference alongside Cohen's d gives your team both the intuitive number ("0.8 more percentage points in open rate") and the standardized comparison ("d = 0.25, a small-to-medium effect"). In program evaluation for grant reporting, effect sizes tell funders how much impact the program had in a way that sample size can't inflate. A program that shifts civic engagement by d = 0.6 among 80 participants is more impressive than one that achieves d = 0.05 among 50,000, even though the latter might have a smaller p-value. In online petition optimization, tracking effect sizes across multiple tests over time reveals whether your improvements are getting larger or whether you're hitting diminishing returns, something raw conversion rates alone won't show.

Statistical significance tells you something probably happened. Effect size tells you whether what happened is big enough to matter. Always report both.


See It

Drag the slider to move the two group distributions apart. Watch how Cohen's d changes and see when the overlap between groups starts to shrink meaningfully.


Reflect

Think about the last A/B test or campaign comparison where your team celebrated a "significant" result. How large was the actual difference in practical terms? Would you have made the same decision if you'd also reported the effect size?

When planning your next experiment, try defining the smallest effect size that would justify changing your approach. How does that number compare to the effects you've seen in past tests?