Mann-Whitney U Test

Your organization just ran two versions of an online fundraising appeal tied to an upcoming vote on environmental legislation. The standard version opened with policy language and a "Donate Now" button. The story-led version featured a first-person account from a community organizer affected by the issue. After two weeks, the standard appeal brought in donations clustering between €10 and €30, with a few gifts above €40. The story-led appeal produced a messier spread, with donations scattered between €25 and €90, heavily skewed toward the lower end.

The campaigner wants to know which appeal performed better. Your instinct might be to compare average donation amounts using a t-test. But when you plot the data, neither group looks anything like a bell curve. The donation amounts are right-skewed with long tails, exactly the shape we explored when discussing skewness. With only a few dozen gifts per group, the central limit theorem hasn't kicked in enough to save you. The t-test's assumptions are on shaky ground.

This is where the Mann-Whitney U test comes in. Instead of comparing averages, it asks a different question altogether. Does one group tend to produce larger values than the other? It answers this by ignoring the actual euro amounts and working entirely with ranks. You take all the values from both groups and combine them into a single list. Then you sort that list from smallest to largest and assign each value a rank, with 1 for the smallest, 2 for the next, and so on. If two values are tied, they share the average of the ranks they would have occupied. Once every value has a rank, you add up the ranks separately for each group. If one group's donations are genuinely larger, its values will cluster toward the higher ranks and its rank sum will be disproportionately large. If the groups are similar, their ranks will be interleaved and the sums will be close to what you'd expect by chance.

The U statistic formalizes this comparison. It counts, in essence, how many times a value from one group exceeds a value from the other. A U near zero means every observation in one group is smaller than every observation in the other, which is about as different as two groups can get. A U near the midpoint means the groups are thoroughly mixed. Just like the t-test, you convert U into a p-value to judge whether the observed difference in ranks is large enough to be unlikely under the null hypothesis that the two groups come from the same distribution.

The beauty of this approach is what it doesn't require. Because it works with ranks rather than raw values, the Mann-Whitney U test makes no assumptions about the shape of your data. It doesn't care whether donations follow a bell curve, whether they're skewed, or whether they contain extreme values. That €500 gift that would pull the mean of a small group sideways? In the ranking, it simply gets the highest rank, carrying no more influence than the next-largest value. This makes the Mann-Whitney U test nonparametric, meaning it doesn't depend on any assumptions about the underlying distribution.

You'll find yourself reaching for this test throughout digital advocacy work. Comparing donation amounts between two appeal versions when sample sizes are small and the data is heavily skewed is the classic case, and it's exactly the scenario the t-tests entry flagged as problematic. It's equally useful for comparing engagement scores between two supporter segments when those scores aren't bell-shaped. If you're evaluating lobby meeting outcomes between two regional chapters using ordinal ratings on a 1-to-5 scale, the Mann-Whitney U handles ordinal data gracefully because ordinal data is already about ordering. And when you're comparing petition signature counts between two campaign strategies with small samples, this test gives you a reliable answer where the t-test might not.

The tradeoff is worth understanding. By converting your data to ranks, you throw away some information. The gap between a €10 and a €50 donation is treated the same as the gap between €50 and €51, because ranking only cares about order, not magnitude. When your data genuinely is bell-shaped and your samples are large enough, the t-test will have more statistical power to detect differences because it uses all the information in the raw numbers. The Mann-Whitney U is the right tool when you can't trust that your data meets the t-test's assumptions, or when you're working with ordinal measurements that have no meaningful distances between values.

When your data is skewed, your samples are small, or your measurements are ordinal, the Mann-Whitney U test lets you compare two groups without pretending the data is something it's not. It trades a little power for a lot of robustness.


See It

Drag the slider to shift the Story-Led Appeal donations up or down. Watch how the combined ranking changes and see the U statistic and p-value respond as the groups separate or overlap.


Reflect

Think about the last time your organization compared a metric across two groups, whether donation amounts, petition signatures, or engagement scores. Was the data roughly bell-shaped, or was it skewed? If it was skewed, would a rank-based comparison have been more appropriate than comparing averages?

When you see "average donation amount" in a campaign report, consider what information the average might be hiding. Would knowing which campaign tends to produce higher-ranked gifts tell you something different from knowing which campaign has the higher average?