A/A test is an A/B test, with the difference being that the two variations which form the user experience are identical. An A/A test helps marketers examine the correctness of the setup and the reliability of an A/B testing platform.
WHAT IS AN AA TEST GOOD FOR?
AA testing is a good technique for checking the health of an integration of a tool with a website. In addition, it is good to check the quality of the execution (choice of variation and stickiness), data collection and integrity of the tool, and that no data is lost or altered.
For example, if the AA test has an even traffic split, one hopes that over time, the traffic will indeed be split evenly between the two variations and that the KPIs are roughly the same.
As a side note, all of this has nothing to do with the statistical engine itself, which identifies the collected data as an input.
AA tests can also be used to examine the reliability of the statistical engine itself, but in doing so, one needs to understand that an AA test is a very specific non-typical scenario for an AB testing stats engine to operate. So, this test is very limited with respect to this goal. One should also have a good prior understanding of what to expect from an AB test stats engine in such scenarios.
BEFORE CONDUCTING AN AA TEST
One must first do their best to make sure the integration with the tool is intact, and the tool is used properly.
In an AA test, the user sets up an AB test, yet injects two identical variations, A and B, and since A=B, we call it an AA test. Data should be collected over a substantial period of time, and the user should stop to evaluate the results using the tool.
WHAT SHOULD I EXPECT FROM AN A/A TESTING TOOL?
In such tests, after a not-so-long period of data collection, intuitively, the tool is expected to:
- Show both variations have similar results (in metrics and in P2BB)
- Not declare a winner
- If it has a “declare draw” feature, it should declare it
How do you run an A/A test?
Running an A/A test is much like an A/B test, except in this case the two groups of users, which are randomly chosen for each variation, are given the exact same experience. Since the two groups of users experience the exact same treatment, we expect the KPIs of the two variations to be roughly the same, and the statistical engine driving the test, to remain inconclusive indefinitely.
If something goes wrong, either with the setup of the test (for example, the populations of users reaching each variation is different) it may cause a false declaration of a ‘winner,’— namely, that one of the variations was better than the other in a statistically significant manner. However, it is worth mentioning that a false declaration can happen by pure chance in a small portion of A/A experiments (according to the required confidence level).
What do I do if I have a winner in an A/A test?
Check the experiment setup, and understand how it may have broken the statistical assumptions of the model. Or, alternatively, check the reliability of the statistical engine used to conduct your test.
HOW DO I TEST THE RELIABILITY OF A STATISTICAL ENGINE?
The following technique is good for testing the reliability of any tool that produces probabilities of future events. As an example, this of a weatherman every day he tells you there is a probability of rain tomorrow.
Let’s say today he says there is an 11% chance of rain for tomorrow. How can we evaluate if his prediction is of quality or not? Clearly, whether it rains or not, we can not declare that our weatherman was right or wrong. It is impossible to do so in one trial. However, as many trails accumulate, we can evaluate the quality of the prediction. We do that by bucketing all the days when our guy declared the probability to be between 10 and 12. Over many trials, the rate of rain days that were declared as 10-12% rain should approach 11% over time.
WHEN LOOKING AT THE RESULTS
Since AB tests are a random statistical process, pure randomness has its effects. For example, if the engine declare a winner when Probability to Be Best reached 95%, it means that in 5% of the times that one of the variations has reached that 95% Probability to Be Best threshold, and the test was left running, the leadership will eventually turn to some other variation (aka, the declaration would turn out to be wrong).
Secondly, a statistical engine works under certain assumptions. Think of it as a disclaimer that the engine itself has. If the assumptions are broken, it hurts the reliability of the numbers and declarations produces by the engine, and it is not clear to what extent.
TYPES OF A/A TESTS
Hypothesis testing vs Bayesian
In Hypothesis testing based AB test tools, one needs to pre-determine a sample size and wait until each variation has enough samples. If everything goes well, once the sample size is reached, the user can see that the difference in KPI which was measured is not statistically significant, and the test can be stopped.
A Bayesian A/B test engine is different. Since it does not require a pre-determined sample size (a huge advantage of the tool), and it produces “Probability to Be Best” it often causes confusion. The best variation is the best even if it is better by a tiny fraction of the main metric. This means, that as more and more data is collected, the engine becomes more and more sensitive to tiny differences in KPI and may show a very high “probability to be best” to one of the variations even though the variations are identical and everything is perfectly fine.
In such extreme scenario of an A/A test, as data is accumulated, there are two forces at play:
The randomness is averaged out and the performance of the main metric should get closer to each other.
The engine becomes more and more sensitive to the tiny difference in KPIs, like and ever-strengthening magnifying glass.
Pure chance or tiny breakage of assumptions can cause force 2 to be stronger than force 1 and a winner will be declared. This does not mean that the engine is not reliable.