Guidelines for running effective A/B tests

Read the full transcript

The internet today is one of the primary sources for bringing in customers, generating revenue. And often we have discussions about what might be the best ways to present ourselves, present our companies. The AB test lets you gain not just revenue and action from your website, it lets you gain information, and lets you gain information by making comparisons among those different alternatives that you might be considering. You can present different options to your customers, to your visitors to your website. That’s the A and the B, see how they respond, and when you see their responses, that lets you make a comparison, not based on what you thought might be the best, but on what you see works with your customers.

So the straightforward one is comparing two options, A and B. The more complex ones can compare combinations. So you could have A, which is a combination of two or three things, and B, which is a different combination, and C, which is still another combination. And if you apply methodology which is based on a statistical approach called design of experiments, in one test, you can learn about several factors, the effect of several factors.

So when you’re thinking about setting up an AB test, invariably, you have some alternatives that you wanna compare, and the tests usually follow methodology that’s been developed over many years in the statistical literature. What you’d like is to make sure that the people who see option A are, in some sense, as similar as possible to those who see option B. And it’s become well-established in the scientific literature, going back to work of the statistician Ronald Fisher, about 100 years ago, that the best way to do that is to make sure that you randomly allocate who gets which of the options that you want to compare.

So if you’re gonna start an experiment, you have to think about what are the options that are on the table. What do you want to compare? You have to have an engine at your disposal, framework that’s gonna let you make those random allocations. So if someone comes into your website, they’re not just gonna get a landing page, you have to decide what landing page they’re gonna get. You control that, and you want to control that by allocating it at random between the groups. That’s what’s gonna guarantee having a fair comparison.

Now, in order to validate the data that you’re capturing, you must ensure, for example, that this random allocation occurs. Because if you give the young people version A and the older people version B, and you see a difference, you will never know if it’s due to the age or to the web page design. That’s called confounding. So the random allocation, in a sense, establishes some causality on what is really impacting user behavior.

There are a number of ways that you can go about analyzing the data that you get from an AB test. Classical statistics takes a view that there is some true value for that KPI, and for example, it might be common if you’re comparing two options, to start with a framework that says both options are identical unless we get evidence that shows that they’re different. The Bayesian analyses take a somewhat different viewpoint, and rather than saying there’s some value, we just don’t know what it is, describe that value by putting a probability distribution on it. So you get more and more data. The probability distribution gets tighter and tighter about focusing on what we’ve now learned so that we have better information. It often gives a much more intuitive way, in particular in terms of business decisions, for looking at how to characterize what we know, rather than saying, “Well, it could be between this and this, and either it is, or it isn’t.” That’s sort of the classical way. And we give a more gradated answer by saying, “Well, this is the most likely.” And then there’s a distribution that describes how unlikely values become as we move away from what we think is the most likely value.

Bayesian statistics give you a, I think, a much more intuitive and natural way to express those understandings that you get at the end of an experiment in terms of say, what’s the probability that A is better, what’s the probability that B is better. So you bring in the data, use the data, and now Bayesian inference provides standard rules, and very strictly mathematical rules for taking the data, using that to update your prior to get what’s called a posterior distribution.

The posterior is, this is your current view. This is what you think after having seen the data. What describes your uncertainty about the KPIs after having seen the data, and you combine those two together in order to get your final statement about what it is you think. Can you really state a prior distribution? Is there a defensible way to state what you think in advance? As again, this has certainly been the source of friction between people in the different statistical camps. In the case of most AB tests, you’re getting lots and lots of data, and the data is going to wash out anything that you might’ve thought in the prior. So that for typical AB tests, again, because you have very large data volumes, it no longer becomes, in my mind, very controversial because whatever you put in to get the Bayesian engine going is essentially gonna be washed out by the data. And the data is really going to be the component that dominates what you get in the end.

So sample size is an important question in any study that’s going to be run. Whenever you go out to get the data. How long do I have to run my test? How many visitors to the website do I have to see before I can make conclusions? In order to get a test that gives a complete and honest picture, representative picture, of all the people who are long term gonna be exposed to whatever decisions you make, you wanna make sure that you include everyone within a cycle like that. You don’t wanna stop the test after having seen only visitors on Tuesday and Wednesday. And it may turn out that the decision you made is really catastrophic for those who come in on the weekends. So you always want to make sure that you pick up any potential time-cycles like this. And in terms of the sort of Bayesian descriptions that we were talking about later, I think this really gets to the heart of, in the end, you’re gonna get these posterior distributions that describe your KPIs, and how narrow do you want them to be? The tighter the interval, the more you know.

You can have a tight interval, but if the people who looked at the website and were part of the randomization subgroup, then whatever you’re going to get will not apply to the general group. So you really should work on both fronts. One is, how much information do you have? The tighter the interval, the more informative it is. But also how generalizable is the information that you provide?

Both of these aspects are important, and both of them can be the ones that really are the driver in dictating how large an experiment has to be. A website that has very heavy volume may quickly get to the needed sample size but without being able to see the full representative picture of all the visitors. And of course, it can happen just the other way, that by the time you get enough people in, enough visitors into the site, in order to get to the sample size you need, you’ve already gone through several of those time cycles. And so you have to see how both of these work out and balance against one another.

With careers marked by extensive research, academic, and business experience, Professors Ron Kenett and David Steinberg of KPA Group know how powerful testing and experimentation can be for digitally-savvy brands. During a candid conversation in our Tel Aviv office, they talk through the basics of A/B testing, what methodologies to employ, how to effectively set up a new experiment, as well as how to analyze and validate test results.

Read their full article on Guidelines for Effective A/B Testing.

Tags for this video: A/B Testing