Why reaching and protecting statistical significance is so important in A/B tests

Delve deeper into the mechanics of “classic” A/B testing, become familiar with the actual meaning of statistical significance, and discover the pitfalls threatening the validity of your test results.

Co-Founder & Chief Data Scientist at Perceptive AI

A/B testing is one of the most widely used techniques for web page optimization. It allows marketers and website operators to make smarter and more data-driven decisions about their creative instincts. Proposed changes are not judged purely by intuition (or by someone having better arguing skills at work) but rather based on specific desired goals – whether immediate (such as a button’s CTR) or longer-term (such as becoming a paying customer). At the same time, tests protect against making a major mistake that would dramatically harm engagement.

Let’s delve deeper into the mechanics of “classic” A/B testing, become more familiar with the actual meaning of statistical significance, and discover the pitfalls threatening the validity of your test results. I’ll finish by presenting a few of the optimization alternatives to the classic A/B testing offered by Dynamic Yield.

Intuitively, the procedure is simple: First, we create some variation of our original web page (a.k.a. the baseline). Then, we split the traffic randomly between two variations (randomly allocating visitors according to some probability), and finally, we collect data regarding our web page performance (metrics). After a while, we look at the data, pick the variation that performed best, and cancel the one that performed poorly.

Sounds intuitive and straightforward? No!

We have to keep in mind that at the moment we pick a variation, we are generalizing the measures we collected so far to the entire population of potential visitors. This is a significant leap of faith, and we have to do it in a valid way, otherwise, we are often bound to make bad decisions that will harm our web page in the long run. The process of gaining validity is called hypothesis testing, and the validity we seek is called statistical significance.

In hypothesis testing, we begin by phrasing a claim called the “null hypothesis,” which states the status quo, such as “the original page (baseline) has the same CTR as our newly designed page.” The procedure then is to see if we can reject that claim as being highly improbable.

How do we do that?

First, we have to understand where can we fall and make a mistake. There are two ways: first, we may wrongly reject the null hypothesis. After a brief look at the data, we may claim that there is a difference in performance between the new and the old variation of our web page, while there actually is no such difference, and the difference we observed is due to pure chance. This type of mistake is called “type I error” or “false positive.”

The second possible pitfall is that after a brief look at the performance of each variation so far, we’ll see no major difference and wrongly conclude that there really is none. This error is called “type II error” or “false negative.”

How do we avoid these mistakes?

The short answer is: set a proper sample size. In order to determine the proper sample size, we have to predefine a few parameters for our test:

To avoid “false positive” mistakes, we need to set the confidence level, also known as “statistical significance.” This number should be a small positive number often set to 0.05, which means that given a valid model, there is only a 5% chance of making a type I mistake. In plain words, there is a 5% chance of detecting a difference in performance between the two variations, while actually no such difference exists (a 5% chance of mistake). This common constant is commonly referred to as having “>95% Confidence.”

Many people conducting tests would (and most tools available today) would stop with that first parameter, but to avoid the “false negative”-type mistakes, we actually need to define two more parameters: one is the minimal difference in performance we wish to detect (if indeed one exists), and the other is the probability of detecting that difference if such exists. This last quantity is called “statistical power,” and it is often set to a default of 80%. The required sample size is then calculated using these three quantities (an online calculator can be found here). Although this may seem exhausting, and the resulting sample size often seems way too high, the standard approach to testing requires following this procedure carefully, otherwise, we are bound to fall. In fact, even if we do follow the above to the letter, we might still observe an incorrect outcome.

Let’s better understand why.

So, I did all the calculations; can I trust my results completely?

While hypothesis testing looks promising, it is, in reality, often far from bulletproof because it relies on certain hidden assumptions that are often not satisfied in real-life scenarios. The first assumption is usually pretty solid: we assume that the “samples” (namely the visitors we expose to the variations) are independent of each other, and their behavior is not inter-dependent. This assumption is usually valid unless we expose the same visitor repeatedly and count these occurrences as different exposures.

The second assumption is that the samples are identically distributed. Simply stated, this means that the probability of converting is the same for all visitors. This, of course, is not the case. The probability of converting may depend on time, location, user preferences, referrer, and many other potential factors. For example, if during the experiment, some marketing campaign is running, it may cause a surge of traffic from Facebook. This may cause a drastic and sudden change in CTRs (click-through rates), which is based on the fact that people coming from that particular campaign have different characteristics than the general visitor population. In fact, some more advanced optimization techniques we are providing at Dynamic Yield depend on these differences.

The last assumption is that the measures we sample, e.g., the CTR or conversion rate, are normally distributed. It might sound like some obscure mathematical term to some, but the “magic” confidence level formulas depend on this assumption, which is much shakier and often does not hold. In general, the bigger the sample size and the higher the number of conversions we have, the stronger this assumption holds – thanks to a mathematical theorem called the central-limit theorem.

OK, I get that the math is not 100% bulletproof, but what are the pitfalls I should really watch out for?

There are two main pitfalls. First, A/B testing platforms often offer a real-time display of test results as collected so far. On the one hand, this gives the test operator transparency and a feeling of control. However, this may cause an overreaction to the results, which may not be well-cooked yet. Stopping a test before the predefined sample size is reached while repeatedly looking at the intermediate results and stopping at the first moment of statistical significance is a sure recipe for mistake making. This not only causes a statistical bias toward detecting a difference that is not there, but that bias is not something we can quantify in advance and correct (for example, by demanding a higher significance level).

The second pitfall is to have too high of expectations with regard to the performance of the winning variation well after the test is over. Due to a statistical effect called or “regression toward the mean,” the performance of the winning variation over time is not as good as it was during the test. Simply put, the winning variation may actually have won not just because it is somewhat really better but because it was to some extent, “lucky.” That luck ends up, oftentimes, being averaged out over time, and as a result, performance seems reduced.

Do we have an alternative to A/B testing?

We sure do. The world of web page optimization has a very wide range of use-cases, only for some of which A/B testing fits. Assume we run a test over a web page with a long-existing performance baseline, and the change we want to make is supposed to be sustained and performing long after the test is over, and operate the same for all users. In this case, it’s important, as mentioned, to take a deep and exact measure of the performance in a fair and balanced manner between the different alternatives, but that is often simply not possible.

For example, if we wish to test which of three titles works for an article on a news website, we know that the article will be featured on our front page for a few hours only. This means we have to quickly detect the better performing title and also exploit that knowledge within the timeframe of a few hours. In this case, we can use a multi-armed bandit approach, which helps us do just that.

In short, multi-armed bandit continuously splits the traffic between the variations according to the performance measured so far and the level of certainty gained each step of the way. This approach trades-off some of the certainty about which is “really” the best variation in exchange for quicker convergence. This is the first level of optimization we offer at Dynamic Yield.

A deeper level of optimization is aiming toward true personalization. Either manually or automatically, we might wish to expose different users to different variations. As an example of manual control, we may want users visiting from the Netherlands on Queen`s Day to see an orange background on the web page. Such manual control makes a lot of sense in many cases, but doesn’t scale very well: say I also have a special promotion for visitors coming in from Facebook. What should I show that Dutch visitor coming from Facebook? Rules might get complex fast as you begin to think up new variations and audience groups, and there’s an element of educated guesses here: you think they should hit the mark, but no one really knows.

Thus, a different approach that may work better in some cases is an automated personalization approach. Let’s say we are running a cooking recipes website, and we have a box on our web page where we wish to show the user a recommended recipe based on his or her past visits to recipe pages/segments/tags. In this case, a personalization engine can do the job.

To conclude:

As effective as A/B tests can be, they can be misleading if not properly conducted. Reaching proper statistical significance is critical for reliable results. To get there, we need to set the required parameters as well as estimate and stick to the required sample size.

Sampling the results frequently ahead of reaching sample size is a classic recipe for losing validity. If you are looking for shorter-term optimization, you should look into other optimization methods, such as multi-armed bandit or machine-learning-based personalization.

To understand if your test is significant enough to declare a winner, please use our free Bayesian A/B Testing Calculator.