A/A testing and decision making in experimentation
A/A testing is meant to check that the underlying system is running properly. In this post, we'll cover some of subtle issues to consider as you run them.
Both professor Ron Kenett and David Steinberg of KPA Group sat down to talk to more candidly about some of the topics discussed in this article.
Read the full transcript
If you run an A/A test and you quickly identify that some A’s look better than other A’s, and you know they are the same, then your confidence in the A/B testing methodology and technology is questionable. So the simplistic approach to A/A test is, just to give you some sense of confidence to debug the system, maybe there is a problem with the random assignment but nothing beyond that. You could think of more sophisticated approaches, where the A/A test is not just done as a precursor to the A/B test, but it can be done also in parallel with A/B tests for ongoing monitoring. So you could develop more sophisticated systems, which give you control over what’s going on, and that would give you the ability to manage changes in traffic composition and all sorts of things that can happen, that would impact your decision.
For example, it could happen that, there’s some breakdown in the system, you’re running a test and it has, your allocation is being handled by distributed servers, it’s no longer on a single server. One of the servers didn’t work, and in that geographic area, everyone was put into this bucket, and another one was put into that bucket and the people in that geographic area, happened to respond a bit differently than people in other geographic areas. Alright, well now you’ve biased your experiment, your experimental results because all of those people went into those visitors went into one bucket. And those are exactly the kinds of problems and bugs that you’d like to be able to filter out of the system, and the A/A test is a very good way to do that. So that when you do get to comparing different options and drawing conclusions, making business decisions on them, you have confidence that those are working properly.
Another way that it can easily happen and this was pretty common, is from over monitoring the experiment. It’s very natural if you’ve got data coming in, to not wait a week to look at them but to look at them, every day, maybe several times a day. So, someone who comes into the site, maybe they will respond, maybe they won’t respond, maybe they’ll buy something, maybe they won’t. There’s gonna be variability in the data that you see. One of the two A/As, one of your two buckets, it’s gonna show a slightly better KPI, Is it sufficiently better? But if you watch it over time, what’s gonna happen? It’s gonna drift, so, it might show that bucket, this bucket is better, it might show that this bucket is better, but you have to exercise some caution about trying to reach judgements too fast and that’s one thing that can again they can lead to a situation that may be, instead of a confidence builder, something that drives down your confidence. So, be a little bit careful about data peeking, because it’s gonna happen.
In clinical trials, there is something called intermediate analysis. You plan a trial and in some cases you plan but it has to be planned ahead of time, that you’re going to have, an analysis of the data halfway through. This has to be accounted for. You cannot just collect data and decide to look at it, if this was not planned ahead of time. So what happens in A/B testing or A/A testing is that people might do that without having a planned that ahead of time and that has an effect, people might not be aware of that effect and that might mislead them.
Almost everything that we’ve talked about, in terms of A/A testing is also relevant to A/B tests. You do expect to see some differences – that’s the reason that you’re running the test to begin with. So you’re not gonna be as surprised to see differences, in the case of the A/B test, but again, in order to make sure that you’re making the right business decisions, you do wanna make sure you observe the differences that you’re seeing are real ones, that if you now make a decision based on those, on those differences, this is not just something that’s an artifact in the data, it’s something that’s really there, and you can really build a sound, financial decision on it, business decision, that’s gonna, increase your business activity.
This becomes even more complex when you have several variants. So it’s not A/B, it’s A, B one, B two, B three, B four. So the complexity can be okay, we have A and we have five variants. So what do we want? We want to show, which is the best, we want to show that, they are better than A, we want to pick up the top two and if you add to this the option to design the alternatives using what I mentioned, in terms of designing experiments with combinations of the factors, then you get into even more complexity. So, the testing effort is not more difficult to do it’s the design and the analysis that requires a bit more.
Right, some of the early questions that were looked at, were things like, what is the question that Laplace study? What’s the probability that the sun will rise tomorrow? Alright, we’ve seen the sunrise for many days in a row, what’s the probability that it will rise tomorrow? And can you even, does it make sense to even talk about a question like that in a probabilistic sense?
Many of these early discussions had this very metaphysical nature to them. In terms of modern statistical data analysis, the use of Basie and technology and Basie and thinking, has become much much more established. Part of the reason why there was resistance and has been resistance, and some case is still some resistance is I think, sort of on the grounds of objective versus subjective science. And the idea that you’re bringing prior information to a bayesian analysis is it can be regarded and often has been regarded as expressing subjectivity. So I might have a different prior than Ron does, and we might come to different conclusions from the same data and data which problems and this is the standard environment of course for A/B testing, this is much less critical because again the data are so abundant that anything that you fit in through the prior is essentially going to be washed out by the data. And as a result, it doesn’t really make any difference. What I thought in advance, what Ron thought in advance we gonna agree with one another in the end, after we’ve seen the data from the experiment.
The customer experience, which encapsulates the holistic relationship between a brand and customer, is made up of various interactions across channels and touchpoints. It’s no surprise, then, that companies think very carefully about how they design these experiences to attract more business.
How does a marketer know, though, if a proposed change to the customer experience will be a successful one?
The use of A/B testing is an effective, data-driven method of finding out. The basic idea is simple: compare two versions of a page, for example, and show each one to a certain portion of the individuals entering the site and evaluate their results.
Visitors are allocated to one of the two variations randomly. Following well-established scientific principles, this generates a “fair test” between the two versions of the experience. Then, visitor behavior is summarized via informative KPIs like click-through rate (CTR), conversion rate (CR), or revenue, and the average KPIs for the two experiences are compared. With A/B testing, you let customers’ behavior decide which version is the most successful, instead of relying on “intuition.” And, as the old adage has it, “the customer is always right.”
Although A/B tests are simple to describe, a number of important questions arise, which we’ll address below.
- How can I decide which variant (A or B) is better?
- If I declare a winner (B is better than A), how certain am I that this is correct?
- Can I extend the analysis to more than two options?
- How can I know that my testing engine is working properly?
- What sample size is needed for an A/B test?
- Can a test be terminated early if results are clear before the end of the planned testing period?
A/B tests serve the joint goals of learn and earn. The ultimate goal is to earn more from your site or digital property. To do so, you need to learn by collecting and analyzing the data on KPIs during the test. As you collect more data, you will gain confidence as to which experience is better.
An excellent summary of what you have learned is the Probability to Be Best (or P2BB). This is a simple statistic that states the probability that B is the better variant, i.e. has the larger KPI across the visitors to your site. For example, if P2BB is 0.8, then the probability that B is the better choice is 80%, and the probability that A is the better choice is 20%. P2BB is an ideal basis for guiding the business decisions that must be made following an A/B test.
How do we compute P2BB?
To answer that question, we need to give you some background on what is called Bayesian statistics, named after the British clergyman Thomas Bayes who lived in the 18th century. The Bayesian approach to statistics uses probability distributions to describe anything that we don’t know and want to learn from data.
Some early questions that were looked at were “what’s the probability that the sun will rise tomorrow?” Back then, many wondered whether it even made sense to talk about a question like that in a probabilistic sense. The idea that you’re bringing prior information to a bayesian analysis was often regarded as expressing subjectivity.
That’s very different from the classical Frequentists approach you might have studied in an introductory statistics course, where either A is better than B, B is better than A, or they are identical. The question “what is the probability that B is better” is not part of the lexicon – you can’t give an answer, and in fact, you’re not permitted to ask the question!
However, anything that you fit in through the prior is essentially going to be washed out by the abundance of data generated in an A/B test. So in the end, it doesn’t really matter what was thought in advance, as there will be an agreement in the end after the data from an experiment is seen.
The Bayesian view makes probability summaries the common currency. You will end the A/B test with probability distributions describing what you know about the KPI for A and the KPI for B. Combining those two distributions lets you compute P2BB.
Another advantage of P2BB is that it extends directly to tests with more than two variants. At the end of the experiment, you can compute the P2BB for each one of the variants. If there is a clear winner, it will have P2BB close to 1. Clear losers will have P2BB close to 0. You may have two variants that are essentially the same and are better than all the others. In that case, each of the two leading variants will have P2BB near 0.5.
Checking the System
You need a reliable test engine to run a valid A/B test so that when you make a decision, it reflects a real difference between the tested alternatives and is not due to chance, or worse, a bug in the engine.
There are many things that can go wrong:
- The advantage of one variant is momentary and due to unusual traffic fluctuations
- The randomization routine might be faulty, outcome data may not be recorded correctly
- The analysis routine might not be correct
For example, suppose the allocation was random for some visitors but gave variant A to all those who came from a particular geographic region. If the visitors from that region react differently than those in other regions, the allocation failure will bias your experiment.
You also need a representative sample of visitors. When you make a decision from an A/B test, you are always extrapolating from the site visitors involved in your experiment to all those who will visit your site in the future. For extrapolation to be valid, you need to make sure that all segments of those future visitors are covered in your test, and in roughly the same proportions. That’s what we mean by a representative sample.
As an example, web commerce is often subject to natural time trends in the outcome data. Diurnal effects or day-of-week effects may be present. If so, you want to make sure that (i) at least one full-time cycle is included in the experiment, and (ii) that randomization continues to work properly throughout the cycle. As another example, some of your traffic may come from laptops and some from smartphones. You want to know that both of these groups get adequate exposure in your test.
To check that the underlying test engine is running properly, you may want to begin with a small number of A/A tests. The idea of A/A testing is to compare an experience to itself. As both experiences are identical, you obviously don’t expect to see any differences between them – if you run an A/A test and quickly identify that some A’s look better than other A’s, and you know they are the same, then your confidence in the A/B testing methodology and technology is questionable.
A/A testing gives a basis for trusting the results of subsequent A/B tests; setting a benchmark for what happens when there are no differences between the two variants under the test helps you appreciate the difference you see when comparing two different variants. Moreover, you can use your standard experience in the A/A test, so there is no risk of lost activity from using a new version.
In a more sophisticated approach, the A/A test might not just be done as a precursor to the A/B test, but also in parallel for ongoing monitoring. This would give you more control and the ability to manage changes in traffic composition as well as all sorts of things that can impact decision-making.
The A/A test format sounds easy, but there are some subtle issues to think about. Data is variable, so even if both groups in your experiment got the same experience, you will still see some difference in outcome. You need some yardsticks to tell you whether those differences are big enough to indicate problems, or are just part of the natural variation in the data. You need to think about how long to run an A/A test and on how many A/A tests you want to run before moving ahead.
We have already suggested a good yardstick: Probability to Be Best (P2BB). This yardstick takes full account of the variability in the data. It is important to check the average KPIs as well. Since you are comparing an experience to itself, you expect these to be very similar to one another. You should also look at the fraction of traffic directed to each condition. The allocation should be close to 50% for each one – an allocation that deviates from 50% is a sign of trouble.
Although P2BB is an excellent summary of what is learned from the data, we need to add a few words of caution.
First, data has natural variability. P2BB depends on the data, so it is also variable. Consequently, there will be A/A tests that give a P2BB close to 0 or 1, even though you know there is no difference. No A/B testing strategy is immune to occasionally declaring a winner in this case.
Second, data volumes are huge in many online experiments. With huge sample sizes, even very small differences in average KPI can lead to an extreme P2BB. This reiterates the importance of checking the KPI averages as well. You might have a clear indication of a winner from P2BB but associated with a KPI difference that makes almost no difference to your bottom line.
Third, beware of over-monitoring. Testing engines let you track the results from your experiment. So it is natural to check regularly to see if the data accumulated thus far points to a winner. P2BB will vary during the course of the experiment, reflecting the variation in visitor outcomes. If you check the “current P2BB” every hour in a two-week A/A test, that adds up to more than 300 P2BB values. Some may be close to 1, others may be close to 0. A common, but risky, practice is to stop a test as soon as P2BB first exceeds a threshold (say dropping below 0.05 or rising above 0.95). Why is that risky? Because it is not so surprising that the most extreme hourly P2BB value will cross one of these boundaries. Getting a final P2BB below 0.05, for example, will happen in about 5% of your A/A tests. But getting just one P2BB result below 0.05 will happen far more often. Data peeking can seriously inflate your false winner rate if not preplanned. You should use only the final P2BB (meaning you’ve gathered the required data in relation to the sample size), together with the average KPIs, to evaluate your engine.
Further, if you want to thoroughly check the engine, run several A/A tests. We all know that it is dangerous to reach too many conclusions based on a sample of size 1. The resulting final P2BB’s should be “spread out” between 0 and 1.
When data volume leads to an extreme P2BB alongside a small difference in average KPIs, it may be helpful to carry out a more refined analysis that reflects the business impact of the comparison. Consider defining a minimal critical difference in KPIs that has an impact on your business. For example, you might decide that a difference of less than 0.5% between two competing variants would not justify making a change. In that case, you can use the Bayesian analysis to break down the summary of an A/B test into three distinct outcomes: B is better, B is worse, or B is the same as A (i.e. the difference is smaller than the critical margin). For A/A tests, you will often find that the third option has high probability, meaning you have strong evidence that there is not a meaningful difference between the two variants.
One should also be mindful of the complexity of adding several variants – when it’s not A/B, but A, B one, B two, B three, B four. So, if we have A and four variants, we want to show which is the best, that they are better than A, and pick up the top two. The testing effort is not more difficult to do, but the design and analysis of combinations requires a bit more when it comes to introducing these alternatives.
How large should an A/A test be?
The idea of the A/A test is to provide a realistic dry run of the system. You want to be sure that about half the visitors are allocated to each option to check that the KPIs stay close to one another and that P2BB does not declare a clear winner.
The sample size will depend on several factors:
- The type of data (binary, like CTR; continuous, like revenue)
- The current performance
- The performance level that would be a significant improvement
- For continuous KPIs, the extent of variation between visitors
Below we will expand on each of these and highlight with an example.
The type of data is important, as different sample size formulas are relevant for each. The current performance must be known or estimated. The sample size is always set with the goal of identifying some target change. For an A/B test, you can set the target improvement to match your expectations from the new test page. That approach doesn’t work for an A/A test, where both versions should have the same KPI.
Instead, we recommend setting the target to be the level of importance that you see as an improvement. For example, with a KPI of 0.010, a 10% increase to 0.011 might be your threshold for large. If your average revenue is $1 per visitor, an increase of just 3%, to $1.03, might be important.
A simple guide is to ask, “what is the size improvement that we would definitely want to identify?” The extent of variation is important for continuous KPIs. Typically, what you will want is the standard deviation (SD) of the result among those who are converters.
It is worth noting that the same inputs are relevant for an A/B test and they are used in essentially the same way. But there is one distinction – the goal of an A/B test is to find a difference, so it’s clearly important to provide a near guarantee that important differences will be identified. We achieve that sharper focus by increasing the sample sizes. In an A/A test, though, the goal is to check the system. There is no need for the tight comparison of an A/B test, so we use a less stringent criterion on focus, which leads to smaller sample sizes.
Further, we recommend running an A/B test for at least one complete time cycle. If weekday visitors prefer A but weekend visitors prefer B, we would want to have both of these groups in our data before making a decision. With an A/A test, there is no concern that preferences for one option versus the other will change over time, so there is no need to run a test for a full cycle. If you run several A/A tests on the same experience, it is a good idea to spread them out over time so that you do see the engine in operation across a full-time cycle.
We can illustrate these ideas with an example.
Suppose your current CTR is 0.0060. An improvement to 0.0065 would be important. The test should include 191,000 visitors to each of the two options.
The calculation should be tempered with practical wisdom. If you compute a sample size that will take weeks to achieve, then the target improvement, even if it is an important gain, is probably too difficult to achieve. In that case, you should reset the target so that your A/A test runs no more than two weeks.