Imputation Methods

You're preparing a report on donor giving for the board. You pull up the spreadsheet and find that 40 out of 200 records are missing the "last gift amount" field. Maybe the data wasn't entered. Maybe the CRM migration dropped some values. Whatever the reason, you now have a choice. What do you put in those empty cells?

This question matters more than it seems. In the previous entry on missing data, we looked at why data goes missing and how the reason affects whether your remaining numbers are trustworthy. Now comes the practical follow-up. Once you know you have gaps, what do you actually do about them?

The simplest option is deletion. Just remove the 40 incomplete rows and analyze the remaining 160. This is what most spreadsheet users do by instinct, and it works fine when the missing data is truly random and you have plenty of rows left. But deletion shrinks your dataset, which means less precision and wider uncertainty around your estimates. And if the missing rows aren't random (if, say, lapsed donors are the ones with blank gift amounts) then deleting them biases your analysis. You end up studying only the donors who stuck around, which paints a rosier picture than reality.

Mean imputation fills every gap with the average of the values you do have. If the 160 known gift amounts average €45, you plug €45 into all 40 blanks. The upside is that your overall mean stays the same. The downside is that you've just told the data that 40 donors all gave exactly the same amount. That artificially compresses your spread. The standard deviation drops because you've replaced genuine variation with copies of a single number. Any analysis that depends on how spread out the data is (and most do) will be distorted. Relationships between variables get muted too. If gift amount is connected to event attendance, those connections weaken because the imputed values carry no information about attendance at all.

Median imputation works the same way but uses the middle value instead of the mean. It's more robust when your data is skewed, which donor data almost always is. A handful of large gifts pull the mean upward, so filling blanks with the mean plants them higher than most real donors actually gave. The median puts the imputed values closer to where the typical donor sits. But it shares the same fundamental flaw as mean imputation. You're still replacing unknown variation with a single repeated number, and the spread still shrinks.

Regression imputation is smarter. Instead of using one number for every blank, it uses the other columns in your data to make a tailored guess. If you know each donor's tenure, number of events attended, and communication preferences, a regression model can predict what their gift amount likely was based on donors who look similar. Each imputed value is different, reflecting the donor's own characteristics. This preserves more of the natural variation and keeps relationships between variables intact. The risk is that regression imputation can be too neat. Every imputed value lands exactly on the predicted line, making the data look more orderly than it really is. Real data has noise, and regression imputation strips that noise out.

Multiple imputation addresses this by adding randomness back in. Instead of generating one set of filled-in values, it generates several (typically five to twenty), each with slightly different imputed values drawn from a plausible range. You analyze each completed dataset separately, then combine the results. This approach accounts for the fact that you genuinely don't know what the missing values were. The combined estimates are wider, reflecting your honest uncertainty. Multiple imputation is the gold standard in research for a reason, though it requires statistical software that supports it.

Every imputation method involves a trade-off between simplicity and accuracy. Deletion is easy but wasteful. Mean and median imputation are quick but distort spread. Regression imputation is tailored but overconfident. Multiple imputation is thorough but complex. The right choice depends on how much data is missing, why it's missing, and what you plan to do with the results. For a quick board summary, median imputation might be fine. For a grant evaluation where precision matters, multiple imputation is worth the effort.

Filling in missing data is never neutral. Every method encodes an assumption about what the gaps would have contained, and that assumption shapes every conclusion you draw afterward.


See It

Switch between imputation methods to see how each one fills the gaps (orange dots) and reshapes the distribution's center and spread.


Reflect

Think about a dataset your organization uses regularly. How many fields have missing values? When you run reports, are those rows silently dropped, or is something else happening? Do you know which approach your tools use by default?

If you filled in the blanks with the average, how would that change the story the data tells about the spread of your donors' behavior? Would it make things look more uniform than they really are?