A/A testing and decisioning making in experimentation
A/A testing is meant to check that the underlying system is running properly. In this post, we'll cover some of subtle issues to consider as you run them.
The customer experience, which encapsulates the holistic relationship between a brand and customer, is made up of various interactions across channels and touchpoints. It’s no surprise, then, that companies think very carefully about how they design these experiences to attract more business.
How does a marketer know, though, if a proposed change to the customer experience will be a successful one?
The use of A/B testing is an effective, data-driven method of finding out. The basic idea is simple: compare two versions of a page, for example, and show each one to a certain portion of the individuals entering the site and evaluate their results.
Visitors are allocated to one of the two variations randomly. Following well-established scientific principles, this generates a “fair test” between the two versions of the experience. Then, visitor behavior is summarized via informative KPIs like click-through rate (CTR), conversion rate (CR), or revenue, and the average KPIs for the two experiences are compared. With A/B testing, you let customers’ behavior decide which version is the most successful, instead of relying on “intuition.” And, as the old adage has it, “the customer is always right.”
Although A/B tests are simple to describe, a number of important questions arise, which we’ll address below.
- How can I decide which variant (A or B) is better?
- If I declare a winner (B is better than A), how certain am I that this is correct?
- Can I extend the analysis to more than two options?
- How can I know that my testing engine is working properly?
- What sample size is needed for an A/B test?
- Can a test be terminated early if results are clear before the end of the planned testing period?
A/B tests serve the joint goals of learn and earn. The ultimate goal is to earn more from your site or digital property. To do so, you need to learn by collecting and analyzing the data on KPIs during the test. As you collect more data, you will gain confidence as to which experience is better.
An excellent summary of what you have learned is the Probability to Be Best (or P2BB). This is a simple statistic that states the probability that B is the better variant, i.e. has the larger KPI across the visitors to your site. For example, if P2BB is 0.8, then the probability that B is the better choice is 80%, and the probability that A is the better choice is 20%. P2BB is an ideal basis for guiding the business decisions that must be made following an A/B test.
How do we compute P2BB?
To answer that question, we need to give you some background on what is called Bayesian statistics, named after the British clergyman Thomas Bayes who lived in the 18th century. The Bayesian approach to statistics uses probability distributions to describe anything that we don’t know and want to learn from data.
That’s very different from the classical Frequentists approach you might have studied in an introductory statistics course, where either A is better than B, B is better than A, or they are identical. The question “what is the probability that B is better” is not part of the lexicon – you can’t give an answer, and in fact, you’re not permitted to ask the question!
The Bayesian view makes probability summaries the common currency. You will end the A/B test with probability distributions describing what you know about the KPI for A and the KPI for B. Combining those two distributions lets you compute P2BB.
Another advantage of P2BB is that it extends directly to tests with more than two variants. At the end of the experiment, you can compute the P2BB for each one of the variants. If there is a clear winner, it will have P2BB close to 1. Clear losers will have P2BB close to 0. You may have two variants that are essentially the same and are better than all the others. In that case, each of the two leading variants will have P2BB near 0.5.
Checking the System
You need a reliable test engine to run a valid A/B test so that when you make a decision, it reflects a real difference between the tested alternatives and is not due to chance, or worse, a bug in the engine.
There are many things that can go wrong:
- The advantage of one variant is momentary and due to unusual traffic fluctuations
- The randomization routine might be faulty, outcome data may not be recorded correctly
- The analysis routine might not be correct
For example, suppose the allocation was random for some visitors but gave variant A to all those who came from a particular geographic region. If the visitors from that region react differently than those in other regions, the allocation failure will bias your experiment.
You also need a representative sample of visitors. When you make a decision from an A/B test, you are always extrapolating from the site visitors involved in your experiment to all those who will visit your site in the future. For extrapolation to be valid, you need to make sure that all segments of those future visitors are covered in your test, and in roughly the same proportions. That’s what we mean by a representative sample.
As an example, web commerce is often subject to natural time trends in the outcome data. Diurnal effects or day-of-week effects may be present. If so, you want to make sure that (i) at least one full-time cycle is included in the experiment, and (ii) that randomization continues to work properly throughout the cycle. As another example, some of your traffic may come from laptops and some from smartphones. You want to know that both of these groups get adequate exposure in your test.
To check that the underlying test engine is running properly, you may want to begin with a small number of A/A tests. The idea of A/A testing is to compare an experience to itself. As both experiences are identical, you obviously don’t expect to see any differences between them. That gives a basis for trusting the results of subsequent A/B tests; setting a benchmark for what happens when there are no differences between the two variants under the test helps you appreciate the difference you see when comparing two different variants. Moreover, you can use your standard experience in the A/A test, so there is no risk of lost activity from using a new version.
The A/A test format sounds easy, but there are some subtle issues to think about. Data is variable, so even if both groups in your experiment got the same experience, you will still see some difference in outcome. You need some yardsticks to tell you whether those differences are big enough to indicate problems, or are just part of the natural variation in the data. You need to think about how long to run an A/A test and on how many A/A tests you want to run before moving ahead.
We have already suggested a good yardstick: Probability to Be Best (P2BB). This yardstick takes full account of the variability in the data. It is important to check the average KPIs as well. Since you are comparing an experience to itself, you expect these to be very similar to one another. You should also look at the fraction of traffic directed to each condition. The allocation should be close to 50% for each one – an allocation that deviates from 50% is a sign of trouble.
Although P2BB is an excellent summary of what is learned from the data, we need to add a few words of caution.
First, data has natural variability. P2BB depends on the data, so it is also variable. Consequently, there will be A/A tests that give a P2BB close to 0 or 1, even though you know there is no difference. No A/B testing strategy is immune to occasionally declaring a winner in this case.
Second, data volumes are huge in many online experiments. With huge sample sizes, even very small differences in average KPI can lead to an extreme P2BB. This reiterates the importance of checking the KPI averages as well. You might have a clear indication of a winner from P2BB but associated with a KPI difference that makes almost no difference to your bottom line.
Third, beware of over-monitoring. Testing engines let you track the results from your experiment. So it is natural to check regularly to see if the data accumulated thus far points to a winner. P2BB will vary during the course of the experiment, reflecting the variation in visitor outcomes. If you check the “current P2BB” every hour in a two-week A/A test, that adds up to more than 300 P2BB values. Some may be close to 1, others may be close to 0. A common, but risky, practice is to stop a test as soon as P2BB first exceeds a threshold (say dropping below 0.05 or rising above 0.95). Why is that risky? Because it is not so surprising that the most extreme hourly P2BB value will cross one of these boundaries. Getting a final P2BB below 0.05, for example, will happen in about 5% of your A/A tests. But getting just one P2BB result below 0.05 will happen far more often. Data peeking can seriously inflate your false winner rate if not preplanned. You should use only the final P2BB (meaning you’ve gathered the required data in relation to the sample size), together with the average KPIs, to evaluate your engine.
Further, if you want to thoroughly check the engine, run several A/A tests. We all know that it is dangerous to reach too many conclusions based on a sample of size 1. The resulting final P2BB’s should be “spread out” between 0 and 1.
When data volume leads to an extreme P2BB alongside a small difference in average KPIs, it may be helpful to carry out a more refined analysis that reflects the business impact of the comparison. Consider defining a minimal critical difference in KPIs that has an impact on your business. For example, you might decide that a difference of less than 0.5% between two competing variants would not justify making a change. In that case, you can use the Bayesian analysis to break down the summary of an A/B test into three distinct outcomes: B is better, B is worse, or B is the same as A (i.e. the difference is smaller than the critical margin). For A/A tests, you will often find that the third option has high probability, meaning you have strong evidence that there is not a meaningful difference between the two variants.
How large should an A/A test be?
The idea of the A/A test is to provide a realistic dry run of the system. You want to be sure that about half the visitors are allocated to each option to check that the KPIs stay close to one another and that P2BB does not declare a clear winner.
The sample size will depend on several factors:
- The type of data (binary, like CTR; continuous, like revenue)
- The current performance
- The performance level that would be a significant improvement
- For continuous KPIs, the extent of variation between visitors
Below we will expand on each of these and highlight with an example.
The type of data is important, as different sample size formulas are relevant for each. The current performance must be known or estimated. The sample size is always set with the goal of identifying some target change. For an A/B test, you can set the target improvement to match your expectations from the new test page. That approach doesn’t work for an A/A test, where both versions should have the same KPI.
Instead, we recommend setting the target to be the level of importance that you see as an improvement. For example, with a KPI of 0.010, a 10% increase to 0.011 might be your threshold for large. If your average revenue is $1 per visitor, an increase of just 3%, to $1.03, might be important.
A simple guide is to ask, “what is the size improvement that we would definitely want to identify?” The extent of variation is important for continuous KPIs. Typically, what you will want is the standard deviation (SD) of the result among those who are converters.
It is worth noting that the same inputs are relevant for an A/B test and they are used in essentially the same way. But there is one distinction – the goal of an A/B test is to find a difference, so it’s clearly important to provide a near guarantee that important differences will be identified. We achieve that sharper focus by increasing the sample sizes. In an A/A test, though, the goal is to check the system. There is no need for the tight comparison of an A/B test, so we use a less stringent criterion on focus, which leads to smaller sample sizes.
Further, we recommend running an A/B test for at least one complete time cycle. If weekday visitors prefer A but weekend visitors prefer B, we would want to have both of these groups in our data before making a decision. With an A/A test, there is no concern that preferences for one option versus the other will change over time, so there is no need to run a test for a full cycle. If you run several A/A tests on the same experience, it is a good idea to spread them out over time so that you do see the engine in operation across a full-time cycle.
We can illustrate these ideas with an example.
Suppose your current CTR is 0.0060. An improvement to 0.0065 would be important. The test should include 191,000 visitors to each of the two options.
The calculation should be tempered with practical wisdom. If you compute a sample size that will take weeks to achieve, then the target improvement, even if it is an important gain, is probably too difficult to achieve. In that case, you should reset the target so that your A/A test runs no more than two weeks.