The right and wrong way to do email A/B testing

Most email software is designed to mislead and skew your results. In this post, we'll cover why that is and the right way to think about email A/B testing.

VP of Global Marketing, Dynamic Yield

Stop me if you’ve heard this one before…

To get the most out of your email list, you’ll want to start A/B testing. And with just a few tweaks to your subject lines, layouts, and copy, your emails will start generating results you never thought possible.

These are the key talking points you’ll hear from email service providers (ESPs) and marketing automation vendors as they subtly push you towards buying their add-on modules for email A/B testing.

Unfortunately, I’ve got some bad news — Most A/B testing software for email is designed to actively mislead you and skew your results.

Chances are that your email marketing software has at least tried to sell you on the idea that they have great tools for email A/B testing. If you’ve used these tools, you might have even been impressed by what they appear to do. A quick look under the hood, however, reveals these “tests” aren’t exactly what they seem. Combine poor statistical methodology with underpowered software, and you get misleading results.

How your email marketing software is misleading you

Most marketers are impressed by (and sold on) a feature where you can run a test using a small sample of your full email list. It runs the test until one of the two variants hits a majority in terms of, say, open rate, and then it declares that email to be the winner of the A/B test. At this point, it auto-deploys the winning email to the rest of your list. Sounds simple enough, but from a statistical point of view, this is a terrible idea.

A quick primer on statistical significance

The problem is a lack of statistical significance. You might be asking, “statistical what?”

Simply put, statistical significance is the likelihood that an experiment is not due to random chance, entirely meaningless, or just flat out wrong.

To achieve an acceptable level of statistical significance, a data set needs to be large enough, and generally the smaller the effect being measured, the larger the data set needs to be.

Statistical significance is measured in terms of % confidence or what’s called a p-value. So, let’s say you want to be 95% sure that the change you’re making to your emails will have a positive effect. Then, you’d want to see tests results that show a 95% confidence level or a p-value of 5%.

As an example, let’s say you have a segment of 4,000 people, and you decide to run an A/B test between two subject lines:

[Sale] Get Our Products For 50% Off


Sale: Get Our Products For 50% Off

The subject line using brackets is opened by 265 people, while the non-bracket version is opened by 250 people. That’s an open rate of 13.25% and 12.5% respectively.

So, the bracket subject line with a 13.25% open rate is the winner, right?

According to most email software providers, it is the winner and should be automatically deployed to the remainder of your list.

BUT, if you run these numbers through any statistical significance calculator, you’ll see that there is only 76% certainty that A is an improvement over B. That’s a p-value of a whopping 24%.

To put it another way, if you were to automatically deploy the “winner” to the remainder of your list, you have a 1 in 4 chance that you might actually be hurting your response rates.

And that’s why it’s so dangerous when these A/B testing tools don’t take into account statistical significance. The “winning” email is the one with the higher open rate, and it will automatically get blasted out to the rest of your list. A more sophisticated test might reveal that the winning email actually performs far worse, but there’s no way to know using most of these systems. The software itself enables and rationalizes bad decision making.

It might seem strange that these ESPs actively promote this kind of flawed testing. The thing is, most ESPs and marketing automation vendors are not really interested in providing the best email testing tools. Generally speaking, their clients have small email lists. Those clients have heard that email marketing is important, and as a result, it’s pretty easy to sell those clients on an email testing product.

To these companies, what matters most isn’t providing rigorous statistical evidence to their customers – heck, I’d wager that most buyers of this software don’t even know that they should be looking for statistical significance.

As long as the software generates a “winning” and “losing” result, most of their users will accept those results as meaningful, because what they’re really looking for is any result.

Think of it this way: when you’re a marketer performing A/B tests what you’ve been tasked with generating is insight. And any result, winner or loser, is still insightful. But an inconclusive result? Too many of those and you’ll stop performing A/B tests and, by extension, stop paying for that A/B testing add-on.

So, most of these software vendors know that any result is better than no result (because you’ll feel like you’re making progress). And, they’re banking on the fact that most marketers aren’t trained in statistics. Net result? It’s like you’re going to a casino where you always win, so you keep coming back.

The next question is, how do you squeeze real results out of your email A/B testing software?

The right way to think about email A/B testing

First of all, if your email marketing platform doesn’t provide a measure of statistical significance, you MUST plug your results into a third-party statistical significance calculator.

Next, never set your email marketing software to auto-deploy “winners.” This must be done manually.

And the truth is that, realistically, you may have a hard time getting significance out of any test with a list of fewer than 50,000 people. The open rates will only be a fraction of that total — 10% is typical — giving you a total of roughly 5000 responses. If it’s a true 50/50 test, any difference between the two variants is likely to be well within the margin of error at that scale.

It’s not that you can’t test 50,000 email addresses, it’s just that the test itself might not deliver statistically significant results. To see meaningful results, you may need to run a given test multiple times. This approach has its own problems — most statisticians will frown upon the idea — but there’s a way to do this that works.

Repeated A/B email tests are most effective when they are used to explore thematic concepts or broad categories. You wouldn’t want to test the same two subject lines over and over, but you could test similar kinds of subject lines. For instance:

  • Do emojis in the subject line increase open rates?
  • Do brackets like [Sale!] or [Special Offer] increase open rates?
  • Will open rates increase if certain words are EMPHASIZED in caps?
  • Does using the recipient’s first name increase open rates?

These are general concepts that are easy to apply to multiple tests, and the results should be fairly consistent across those tests. By aggregating the results, it’s possible to determine winning and losing approaches even with a relatively small email list.

While it is possible to A/B test every aspect of your emails, I also suggest focusing most of your efforts on the subject line. If your list is small, it’s your best bet for generating meaningful results quickly. Whereas an open rate could be 20%, the typical click-through rate is more like 2.5%. The smaller sample size of the CTR makes it much more difficult to achieve statistical significance.

Along the same lines, many companies want to test for emails that lead directly to purchases. This kind of testing is even more complex and has a much lower activity rate than clicks. To generate statistically significant results, you’re going to need a truly massive email list.

Here’s a useful tactic: Focus on increasing your open rate first. You need to be careful with this approach because it’s also the same basic strategy that spammers and scammers use. As long as you’re ethical about it, however, it can be a highly effective approach for increasing your open rates. Generally speaking, the more people that open your emails, the higher your click-through and purchase rates will be.

In my experience, the “best” subject lines are the ones that — without being misleading — pique the reader’s curiosity. They also tend to be somewhat vague…

A good example of this would be changing a subject line from “Quick question about your social media monitoring” to “Quick question.”

There’s going to be some blowback from that kind of subject line, and you may even lose a few subscribers from it. Those losses will be more than offset by the number of people who open the email and only THEN get persuaded to click through. This makes it an ideal strategy for getting more people into the funnel as efficiently as possible.

When people start email testing, it doesn’t take long to pluck the lowest-hanging fruit. It’s easy to tweak subject lines and boost opens rates by 5% at the start. Over time, however, your tests will start to become less exciting, often yielding improvements of 1% or less. This can result in an attitude of “When are we going to be done with all this A/B testing? When do we figure out exactly what to put into the email to get this to work?”

A/B testing isn’t really about being “done.” There’s no such thing as a perfect strategy, tool, or weapon. Success always depends on context, and your goal should be to create a varied arsenal of techniques that will provide you with the best option in any given situation.