Wilcoxon Signed-Rank Test

Your organization just finished a three-month digital campaigning skills program for 14 regional coordinators. Before the program, each coordinator ran an email action alert and you recorded the click-to-action rate, the percentage of recipients who clicked through and completed the petition. After the program, each coordinator ran another alert on the same topic to a comparable audience segment, and you recorded the rate again. Coordinator by coordinator, you can see who improved and who didn't. Some jumped from 4% to 9%. Others barely moved. One actually dropped from 7% to 5%. The campaign director wants to know whether the program made a measurable difference.

Your first instinct might be a paired t-test, which is designed for exactly this situation: the same people, measured twice. But with only 14 pairs, the t-test needs the differences between before and after scores to follow a roughly normal distribution. When you look at those differences, a few coordinators improved dramatically while most improved modestly, creating a lopsided, skewed set of differences. The t-test's assumptions start to wobble.

The Wilcoxon signed-rank test handles this gracefully. Think of it as the paired sibling of the Mann-Whitney U test. Where the Mann-Whitney compares two independent groups by ranking their combined values, the Wilcoxon signed-rank test works with paired observations and ranks the differences within each pair. It keeps the pairing structure that makes before-and-after comparisons powerful while dropping the assumption that those differences need to be bell-shaped.

Here is how it works. For each coordinator, you calculate the difference between the after score and the before score. If someone went from 4% to 9%, their difference is +5. If someone dropped from 7% to 5%, their difference is -2. Any pair with a difference of exactly zero gets set aside, because a zero difference carries no information about direction. Next, you ignore the signs temporarily and rank the absolute values of the remaining differences from smallest to largest. The smallest absolute difference gets rank 1, the next gets rank 2, and so on. If two differences share the same absolute value, they split the ranks they would have occupied, just as in the Mann-Whitney. Once every nonzero difference has a rank, you restore the original signs. Now each rank is either positive (the coordinator improved) or negative (the coordinator got worse).

The test statistic, often called W, is the sum of the positive ranks. If the training program genuinely helped, most differences will be positive and the larger improvements will carry the higher ranks, pushing W toward a large value. If the program had no effect, you'd expect positive and negative ranks to be roughly balanced, with W landing near the middle of its possible range. You compare W against what chance alone would produce under the null hypothesis that the program made no difference. That comparison yields a p-value, telling you how often you'd see a W this extreme if the training truly had zero impact.

The strength of this approach is what it sidesteps. Because it works with ranks rather than raw values, the Wilcoxon signed-rank test doesn't care whether the differences are normally distributed, whether a few coordinators improved wildly more than others, or whether your metric is on a true interval scale. That one coordinator who jumped 12 percentage points doesn't dominate the result the way it would in a paired t-test, because in the ranking it simply gets the top rank regardless of how far it is from the next-largest improvement. This makes the test nonparametric, just like its independent-groups cousin the Mann-Whitney U.

You'll find uses for this test throughout digital advocacy work. When you measure email open rates for the same campaign segments before and after a subject line overhaul, you have paired data that's often skewed. When coordinators rate their confidence in running digital campaigns on a 1-to-10 scale before and after training, those ordinal ratings don't have guaranteed equal spacing between values, making rank-based methods a natural fit. If you track petition conversion rates on the same landing pages before and after a redesign, each page serves as its own pair. And when you compare the number of lobby actions completed by the same group of supporters before and after a targeted re-engagement campaign, the Wilcoxon signed-rank test gives you a trustworthy answer even with modest sample sizes and messy distributions.

The tradeoff mirrors the one we saw with the Mann-Whitney. By converting differences to ranks, you lose information about the magnitude of those differences. When your paired data genuinely produces normally distributed differences and your sample is large enough, the paired t-test will have slightly more statistical power to detect a real change. The Wilcoxon signed-rank test is the right choice when you can't trust that normality assumption, when your sample is small, or when your measurements are ordinal rather than truly continuous.

When you measure the same group before and after a change and the differences are skewed, small in number, or ordinal, the Wilcoxon signed-rank test tells you whether the shift is real without forcing your data into assumptions it can't meet.


See It

Drag any "After" dot up or down to change that coordinator's post-training score. Watch the signed ranks, W statistic, and p-value update as the pattern of improvement shifts.


Reflect

Think about a recent before-and-after comparison at your organization, whether it was engagement scores, action rates, or survey responses. Were the same people or units measured both times? If so, did the analysis account for that pairing, or were the two time points treated as independent groups?

When you see improvement across most but not all members of a group, how do you weigh the few who got worse against the many who got better? Does knowing that a rank-based test handles this naturally change how you'd report the results?