t-tests (One-Sample, Two-Sample, Paired)
Your organization sent an action alert to two randomly selected groups of 80 supporters each. Group A received the standard template with a policy summary and a "Sign Now" button. Group B got a rewritten version leading with a personal story from someone affected by the issue. Group A averaged 1.9 actions per person over the following week. Group B averaged 2.4. That's a difference of 0.5 actions per supporter. The campaigner sees it as proof the new template works. But the data analyst has seen gaps like this vanish before. With only 80 people in each group, random variation can produce differences that look convincing but mean nothing.
The chi-squared test you learned about recently handles categorical outcomes, things like "signed" versus "didn't sign." But here you're comparing averages, continuous numbers like actions per person or average donation amount. For that, you need the t-test.
The t-test is the most commonly used tool in applied statistics for comparing averages. It asks a simple question: given the difference you observed and the amount of variation in your data, is this gap large enough to be unlikely under the null hypothesis? Or could samples of this size easily produce a gap this big just through chance?
The mechanics work like this. You take the difference between your two group averages and divide it by a measure of how much uncertainty surrounds that difference. That measure of uncertainty depends on two things: the standard deviation within each group (how spread out individual results are) and the number of people in each group (more people means less uncertainty). The result is a single number called the t-statistic. A t-statistic close to zero means the difference is small relative to the noise. A large t-statistic, positive or negative, means the gap stands out above the variability. You then convert that t-statistic into a p-value to decide whether the result is statistically significant.
What makes the t-test special is that it works well even with small samples, which is exactly the situation most nonprofit teams face. Unlike the normal distribution approach that requires large samples, the t-test uses a slightly different bell curve called the t-distribution that accounts for the extra uncertainty you carry when estimating variability from limited data. With 20 or 30 people per group, that correction matters. With hundreds, the t-distribution looks nearly identical to the normal curve and the distinction fades.
There are three flavors of t-test, and picking the right one depends on your data structure. The one-sample t-test compares a single group's average against a known benchmark. Your organization's historical email open rate is 22%. You redesigned the subject lines last quarter and the 50 most recent sends average 24.8%. Is that improvement real, or just a lucky stretch? The one-sample t-test answers that by checking whether 24.8% is far enough from 22% to be unlikely under normal variation.
The two-sample t-test (also called the independent samples t-test) compares the averages of two separate groups, like the action alert example above. The people in Group A and Group B are different individuals, randomly assigned. This is the version you'll use most often in A/B testing for campaign experiments, comparing petition sign rates between two landing page designs, or evaluating whether supporters recruited through paid ads engage differently from those who found you organically.
The paired t-test handles situations where the same people appear in both conditions. Say you surveyed 60 supporters about their engagement with your campaigns, then ran a six-week digital mobilization program, then surveyed the same 60 people again. Each person has a "before" and "after" score. Because the same individual contributes to both measurements, their scores are linked. The paired t-test works with the differences within each person rather than comparing two separate groups, which makes it more sensitive to genuine changes because it removes person-to-person variability from the equation.
Knowing which flavor to use shows up constantly in digital advocacy work. When you compare this quarter's average petition signatures per campaign against last quarter's benchmark, that's a one-sample test. When you randomly split your email list to test two versions of an action alert, that's a two-sample test. When you measure each regional chapter's lobby meeting attendance before and after a new training program, that's a paired test because you're tracking the same chapters across time.
One important caution: the t-test assumes your data is roughly normally distributed or that your samples are large enough for the central limit theorem to kick in. For most nonprofit metrics with 30 or more observations per group, this is usually fine. With very small samples or heavily skewed data like donation amounts, you might need a non-parametric alternative. But for the bread-and-butter comparisons that fill your campaign reports, the t-test is the right starting point.
Always pair your t-test result with an effect size and a confidence interval. The p-value tells you the difference is probably real. The effect size tells you whether it's big enough to matter. The confidence interval tells you the plausible range for the true difference. Together, those three numbers give you everything you need to make a sound decision.
The t-test is your go-to tool whenever you need to compare averages between groups or against a benchmark. Pick the right flavor for your data structure, and let it cut through the noise to tell you whether the difference you see is worth acting on.
See It
Drag the sliders to change each group's average and spread. Watch the t-statistic and p-value update in real time as the groups overlap more or less.
Reflect
Look at the last A/B test or before-and-after comparison your organization ran. Was the right type of t-test used for the data structure, or were results from paired data analyzed as if the groups were independent? If you aren't sure, check whether the same people appeared in both conditions.
When you see a "statistically significant" difference in a campaign report, do you also see the effect size? A significant result with a tiny effect might not be worth changing your strategy over. What threshold of practical impact would make you act?