Measuring Statistical Validity in Direct Response Fundraising

istock_000001430671xsmall1Statisticians are equipped with a broad range of detailed tests and tightly controlled procedures to determine the probability of an outcome.

But direct response testing isn’t done in laboratory. Our lab is the messy, busy, ever-changing world … which means it’s difficult to design a perfect test.

In this climate, there are always variables beyond our control that affect our results, whether it’s the mood of our donors or the media attention our cause is attracting. We saw those external variables at work last spring, when one client was about to open a major new facility. In the months before the big day, this organization was the focus of much excitement, and nearly daily articles and blog posts. And the direct mail package this organization mailed to acquire new donors performed incredibly well. But was this an accurate test of how this package will perform over the long term? Probably not.

In the real world, we also face limitations on the resources we have available – including the quantity of names we have available for testing.

Although we can’t always create absolutely perfect tests, we can still design ones that generate practical, useful and actionable results – and statistics can help. Here are a few tips to keep in mind to improve the statistical validity of your testing:

1. It’s not how many people you solicit; it’s how many responses you receive. In order to have a statistically valid test, you’ll need 100 responses for each test cell – 200 gifts for a simple A/B test. For a donor renewal effort with a projected 5% response rate, this means soliciting 4,000 names (2,000 per cell) for a valid test. In a new donor acquisition effort with a 1% response rate, you’d need to solicit 20,000 names (10,000 per cell).

Does this mean you shouldn’t test if you can’t get 100 responses? Not at all. We often test smaller cells – when we’re testing a new list in acquisition, for example.

Just knowing when your test isn’t statistically valid can help ensure that you’re not relying on flawed data. When your quantities are too low to be valid …

Test fewer elements. Is your appeal only going to be getting 100 responses? Ditch the four-way test, and try a split test for more reliable results.

Don’t extrapolate. When you don’t test a statistically valid quantity, you can’t assume a larger group will behave the same way. Don’t make the mistake of ordering 50,000 of that new list just because your first order of 5,000 performed well.

Plan to retest. Always retest. More on this later.

2. Beware of anomalies. As you review your results, keep an eye out for outliers. Did you get a $5,000 gift to your regular membership appeal? A $1,000 gift to your new member acquisition? Large gifts will artificially increase your average gift and skew the results of your test.

To get a better sense of which results are repeatable, remove gifts from your analysis that fall outside of what is normal for your organization. As with all things statistics, there are fancy ways to identify outliers. But since we’re talking practical models for direct response, you can also look at your average gift and try to make some reasonable determinations. For an organization with average gift in acquisition of $75, we’d look carefully at all gifts above $200 to see if our test results changed when these gifts were excluded.

3. Mind the LARGE differences. Don’t just look at whether your test gets a higher response rate than your control. Pay attention to the magnitude of difference, as this can indicate how likely you are to get the same result again.

Small differences, such as a lift in response from 4% to 4.25%, are more likely to be caused by random variation. When we see a larger lift – a response rate increased from 4% to 5% – we can have more confidence that this result will be repeated if we were to repeat the test.

Even with a huge lift, we can’t assume we’ll always get the same result. When segment A generates a 4% response and segment B generates a 9% response, we can’t know that those response rates will stay the same. What we can say is that B is highly likely to outperform A if the test were repeated.

4. Test, Test and Retest. Testing often raises as many questions as it answers. Perhaps a new list looks promising, but doesn’t yield enough responses to make a statistically valid assessment. Or maybe a change to your package increases response by an amount that’s meaningful to the organization but small enough that you can’t be confident that it is repeatable.

That’s where retesting comes into play. If initial tests indicate a winner that’s not statistically valid, retesting can affirm – or make you rethink – testing results.

As you design and analyze your next test, keep these guidelines in mind to ensure you’re drawing the most accurate conclusions you can from your test results. And if you have any other tips you use, post your comments here or email us at nthfactor@nthfactor.com.


Add a Comment