# Why Reaching and Protecting Statistical Significance is So Important in A/B Tests

Delve deeper into the mechanics of a solid A/B test. To understand if your test is significant enough to declare a winner, please use our free Bayesian A/B Testing Calculator.

A/B testing is one of the most widely used techniques for web page optimization. It allows marketers and website operators to make smarter and more data-driven decisions about their creative instincts. Proposed changes are not judged purely by intuition (or by someone having better arguing skills at some work meeting…), but rather based on specific desired goals – whether immediate (such as a button’s CTR) or longer-term (such as becoming a paying customer). At the same time, tests protect against making a major mistake that would dramatically harm engagement. Let’s delve deeper into the mechanics of “classic” A/B testing, understand the actual meaning of statistical significance and discover the pitfalls threatening the validity of your test results. I’ll finish with presenting a few of the optimization alternatives to classic A/B testing offered by Dynamic Yield.

Intuitively, the procedure is simple: first we create some variation of our original web page (a.k.a. the baseline), then we split the visitors’ traffic randomly between two variations (randomly allocate visitors according to some probability) and finally we collect data regarding our web page performance (metrics). After a while, we look at the data, pick the variation that performed best and cancel the one that performed poorly. Sounds intuitive and straightforward? No! We have to keep in mind that at the moment we pick a variation, we are generalizing the measures we collected so far to the entire population of potential visitors. This is a significant leap of faith, and we have to do it in a valid way, otherwise, we are often bound to make bad decisions that will harm our web page in the long run. The process of gaining validity is called hypothesis testing, and the validity we seek is called statistical significance.

In hypothesis testing, we begin by phrasing a claim called “the null hypothesis,” which states the status quo, such as “the original page (baseline) has the same CTR as our newly designed page.” The procedure then is set to see if we can reject that claim as being highly improbable.

## How do we do that?

First, we have to understand where can we fall and make a mistake. There are two ways: first, we may wrongly reject the null hypothesis. After a brief look at the data, we may claim that there is a difference in performance between the new and the old variation of our web page, while there actually is no such difference, and the difference we observed is due to pure chance. This type of mistake is called “type I error” or “false positive”.

The second possible pitfall is that, after a brief look at the performance of each variation so far, we’ll see no major difference and wrongly conclude that there really is none. This error is called “type II error” or “false negative.”

## How do we avoid these mistakes?

The short answer is: set a proper sample size. In order to determine the proper sample size, we have to predefine a few parameters for our test:

To avoid “false positive” mistakes, we need to set the confidence level, also known as “statistical significance.” This number should be a small positive number often set to 0.05, which means that given a valid model, there is only a 5% chance of making a type I mistake. In plain words, there is a 5% chance of detecting a difference in performance between the two variations, while actually no such difference exists (a 5% chance of mistake). This common constant is commonly referred to as having “>95% Confidence”.

Many people conducting tests would (and most tools available today) would stop with that first parameter, but to avoid the “false negative”-type mistakes we actually need to define two more parameters: one is the minimal difference in performance we wish to detect (if indeed one exists), and the other is the probability of detecting that difference, if such exists. This last quantity is called “statistical power,” and it is often set to a default of 80%. The required sample size is then calculated using these three quantities (an online calculator can be found here). Although this may seem exhausting, and the resulting sample size often seems way too high, the standard approach to testing requires following this procedure carefully, otherwise we are bound to fall. In fact, even if we do follow the above to the letter we might still observe an incorrect outcome. Let’s better understand why.

## So…I did all the calculations; can I trust my results completely?

While hypothesis testing looks promising, it is in reality often far from bulletproof, because it relies on certain hidden assumptions that are often not satisfied in real-life scenarios. The first assumption is usually pretty solid: we assume that the “samples” (namely the visitors we expose to the variations) are independent of each other, and their behavior is not inter-dependent. This assumption is usually valid, unless we expose the same visitor repeatedly and count these occurrences as different exposures.

The second assumption is that the samples are identically distributed. Simply stated, this means that the probability of converting is the same for all visitors. This, of course, is not the case. The probability of converting may depend on time, location, user preferences, referrer, etc. For example, if during the experiment some marketing campaign is running, it may cause a surge of traffic from Facebook, for example. This may cause a drastic and sudden change in CTRs (click through rates), which is based on the fact that people coming from that particular campaign have different characteristics than the general visitor population. In fact, some more advanced optimization techniques we are providing with Dynamic Yield depend on these differences.

The last assumption is that the measures that we sample, e.g. the CTR or conversion rate, are normally distributed. It might sound as some obscure mathematical term to some, but the “magic” confidence level formulas depend on this assumption, which is much shakier and often does not hold. In general, the bigger the sample size and the higher the number of conversions we have, the stronger this assumption holds – thanks to a mathematical theorem called the central-limit theorem.

## OK, I get that the math is not 100% bulletproof, but what are the pitfalls I should really watch out for?

There are two main pitfalls. First, A/B testing platforms often offer real-time display of test results as collected so far. On one hand, this gives the test operator transparency and a feeling of control. However, this may cause us to overreact to results, which are not well-cooked yet. Stopping a test before the pre-defined sample size is reached, while repeatedly looking at the intermediate result and stopping at the first moment the required significance level is reached, is a sure recipe for making a mistake. This not only causes a statistical bias toward detecting a difference that is not there, but that bias is not something we can quantify in advance and correct (for example by demanding a higher significance level).

The second pitfall is to have too high of expectations with regard to the performance of the winning variation well after the test is over. Due to a statistical effect called or “regression toward the mean,” the performance of the winning variation over time is not as good as it was during the test. Simply put, the winning variation may actually have won not just because it is somewhat really better, but also because it was to some extent “lucky.” That luck ends up oftentimes being averaged out over time, and as a result performance seems reduced.

Take your testing to the next level. Delve deeper into the mechanics of A/B testing and discover 10 rules for running impactful tests.

## Do we have an alternative to A/B testing?

We sure do. The world of web page optimization has a very wide range of use-cases, only for some of which A/B testing fits. Assume we run a test over a web page with a long existing performance baseline, and the change we want to make is supposed to be sustained and performing long after the test is over, and operate the same for all users. In this case, it’s important as mentioned to take a deep and exact measure of the performance in a fair and balanced manner between the different alternatives, but that is often simply not possible. For example, if we wish to test which of three titles works for an article on a news website, we know that the article will be featured on our front page for a few hours only. This means we have to quickly detect the better performing title and also exploit that knowledge within the timeframe of a few hours. In this case, we can use a multi-armed bandit approach, which helps us do just that. In short, this approach continuously splits the traffic between the variations according to the performance measured so far and the level of certainty gained each step of the way. This approach trades-off some of the certainty about which is “really” the best variation in exchange for quicker convergence. This is the first level of optimization we offer at Dynamic Yield.

A deeper level of optimization is aiming toward true personalization. Either manually or automatically, we might wish to expose different users to different variations. As an example of manual control, we may want to have users visiting from the Netherlands on Queen`s Day getting an orange background on the web page. Such manual control makes a lot of sense in many cases, but doesn’t scale very well: say I also have a special promotion for visitors coming in from Facebook, what should I show that Dutch visitor coming from Facebook? Rules might get complex fast as you begin to think up new variations and audience groups, and there’s an element of educated guesses here: you think they should hit the mark, but who knows?

Thus, a different approach that may work better in some cases is an automated personalization approach. For example, in case we are running a cooking recipes website, and we have a box on our web page where we wish to show the user a recommended recipe based on his or her past visits on recipe pages/segments/tags. In this case, Dynamic Yield’s personalized optimization engine can do the job.

## To conclude:

As effective as A/B tests can be, they can be misleading, if not properly conducted. Reaching proper statistical significance is critical for reliable results. To get there, we need to set the required parameters and estimate and stick to the required sample size. Sampling the results frequently ahead of reaching sample size is a classic recipe for losing validity. In case you are looking for shorter-term optimization, you should look into other optimization methods, such as multi-armed bandit or machine-learning based personalization, both featured by Dynamic Yield. In fact, the approach we take is multi-layered: tests start with automatic traffic tweaks for all visitors, then personalization kicks in when there’s enough data. All the while, you can apply manual rules, as simple or as complex as you need, to get better control of who gets what and when.

Posted By

Categories:
Optimization
Why Reaching and Protecting Statistical Significance is So Important in A/B Tests

• brianlang

Hey Idan,

Great article! I had just a few points:

I don’t think “regression toward the mean” should be used in
this context, as there are all sorts of reasons besides chance/luck as to why a
“winning” variation’s performance may degrade over time, such as change in % of
a sites marketing channel mix, bias from returning customers, etc.

I wouldn’t agree that all MAB’s split traffic by the “level
of certainty gained” – while this may be true with UCB (for example), I don’t
think the same could be said about a more basic MAB such as Epsilon-Greedy. UCB

You’ve noted MAB’s as an alternative approach to running an
A/B test – I’d also recommend an alternative, Bayesian approach to interpreting
results – using a Monte Carlo approach, you can answer the question, “What is
the probability that the Treatment beats the Control given the data collected
so far.”

Another peril worth mentioning is the issue arising from
multiple comparisons, either through multiple Treatments being compared to the
Control, or through segmenting the results by different channel sources, etc,
and not accounting for the FWER / FDR.

Thanks again for the great article!

Brian

• Idan Michaeli

Hey Brian,

As for regression to the mean, I agree that there are a number of other reasons why performance
of a variation can degrade over time, some of which I mentioned later on in the article.
However, regression to the mean is a reason which is inherent, and is expected to be there even in an ideal case, where all other conditions and assumptions remain valid
(samples are iid, meaning no returning visitors, variations remain as relevant for new visitors over time,
and population remains the same).

As for your other two remarks, I tried to keep things simple 🙂
I may discuss this in future posts.

Thanks again!

Idan

• Petar

Hi Idan, very nice article!
I have one question: how to perform analysis of gained revenue in A/B test? If we assume power law distribution of revenue, how can we compare the groups and declare a winner?

Thanks,
Petar

• Idan Michaeli

Hey Petar,
I noticed that there are not many discussions out there about how to perform revenue-based A/B testing.
There is a good reason for that. Doing A/B tests trying to directly optimize revenue is a tricky business and more complicated than traditional CTR-based experiments. One of the main difficulties is the need to measure both the mean and the variance of your sample. I do not want to give you a partial answer, so I plan this to be the subject of a future blog post. Please stay tuned.

×