A primer on A/B testing and experimentation
A comprehensive guide to A/B testing, explaining the differences between A/B and multivariate testing, how to conduct tests in a structured and progressive way, and the thought process behind choosing the right experiment.
A/B testing is a method of comparing two versions of a webpage or app against each other to determine which one performs better against a specific objective. It is one of the most widely used techniques for maximizing the performance of digital assets such as websites, mobile applications, SaaS products, emails, and more.
Controlled experiments provide marketers, product managers, and engineers with the agility to iterate fast and at scale, leading to data-driven, thoroughly informed decisions about their creative ideas. With A/B tests, you can stop wondering why some things are not working, because the proof is in the pudding. It’s the perfect method to improve conversion rate, increase revenue, grow your subscribers base, and improve your customer acquisition and lead generation results.
Some of the most innovative companies, like Google, Amazon, Netflix, and Facebook, developed lean business approaches, allowing them to run over thousands of experiments each year.
As Jeff Bezos has once said: “Our success at Amazon is a function of how many experiments we do per year, per month, per week, per day.”
Netflix wrote in one of their technology blogs back in April 2016: “By following an empirical approach, we ensure that product changes are not driven by the most opinionated and vocal Netflix employees, but instead by actual data, allowing our members themselves to guide us toward the experiences they love.”
And Mark Zuckerberg once explained that one of the things he is mostly proud of that is really key to their success is their testing framework: “At any given point in time, there isn’t just one version of Facebook running. There are probably 10,000.”
What is an A/B Test?
In a classic A/B testing procedure, we decide what we would like to test and what our objective is. Then, we create one or more variations of our original web element (a.k.a. the control group, or the baseline). Next, we split the website traffic randomly between two variations (i.e., we randomly allocate visitors according to some probability), and finally, we collect data regarding our web page performance (metrics). After some time, we look at the data, pick the variation that performed best, and cancel the one that performed poorly.
If not done correctly, tests can fail to produce meaningful, valuable results and can even mislead. Generally speaking, running controlled experiments can help organizations with:
- Solving UX issues and common visitor pain points
- Improving performance from existing traffic (higher conversions and revenue, improve customer acquisition costs)
- Increasing overall engagement (reducing bounce rate, improving click-through rate, and more.)
We must keep in mind that the moment we pick a variation, we are generalizing the measures we collected up to that point to the entire population of potential visitors. This is a significant leap of faith, and it must be done in a valid way. Otherwise, we are eventually bound to make a bad decision that will harm the web page in the long run. The process of gaining validity is called hypothesis testing, and the validity we seek is called statistical significance.
Some examples of A/B tests:
- Testing different sorting orders of the site’s navigation menu (Like in this example from a large electronics retailer in Germany)
- Testing and optimizing landing pages (Like in this example from a European leading airline passenger protection company)
- Testing promotional messages, like newsletter subscription overlays and banners (Like in this example from an international boutique retailer of natural bath products)
How an A/B test is born: Constructing a hypothesis
An A/B test starts by identifying a problem that you wish to resolve, or a user behavior you want to encourage or influence. Once identified, the marketer would typically conclude a hypothesis – an educated guess that will either validate or invalidate the experiment’s results.
In this case, once the problem is identified (low add-to-cart rate, as an example) and a hypothesis is worked out (adding a social proof badge to encourage more website visitors to add items to their carts), you are ready to test it on your site.
The classic approach to A/B testing
In a simple A/B test, traffic is split between two variations of content. One is considered the control and contains the original content and design. The other functions as a new version of the controlled variation. The variation may be different in many aspects. For example, we could test a variation with different headline text, call-to-action buttons, a new layout or design, and so on.
In a classic page-level experiment, you don’t necessarily need two different URLs to run a proper test. Most A/B testing solutions will let you create variations dynamically by modifying the content, layout, or design of the page.
However, if you have two (or more) sets of pages that you’re looking to include in a controlled test, you should probably consider using a split URL test.
When to use split URL tests
Split URL testing, sometimes referred to as “multi-page” or “multi-URL” testing, is a similar method to a standard A/B test, which allows you to conduct experiments based on separate URLs of each variation.
With this method, you can conduct tests between two existing URLs, which is especially useful when serving dynamic content. Run a split URL test when you already have two existing pages and want to test which one of them performs better.
For example, if you’re running a campaign and you have two different versions for potential landing pages, you can run a split URL test to examine which one will perform better for that particular campaign.
An A/B test is not limited to just two variations
If you want to test more than just two variations, you can run an A/B/n test. A/B/n tests allow you to measure the performance of three or more variations instead of testing only one variation against a control page. High-traffic sites can use this testing method to evaluate the performance of a much broader set of changes and maximize test time with faster results.
However, although it is useful for any testing, from minor to dramatic changes, I recommend not making too many changes between the control and variation. Try making just a few critical and prominent changes to understand the possible causal reasons for the results of the experiment. If you are looking to test changes to multiple elements on a web page, consider running a multivariate test.
What are Multivariate tests?
Multivariate tests, sometimes referred to as “multi-variant” tests, allow you to test changes to multiple sections on a single page. As an example, run a multivariate test on one of your landing pages and change it with two new elements. In the first version, add a contact form instead of the main image. In the second version, add a video item. The system will now generate another possible combination based on your changes, which includes both the video and the contact form:
Total test versions: 2 x 2 = 4
V1 – Control variation (no contact form and no video item)
V2 – Contact form version
V3 – Video item version
V4 – Contact form + video item version
Since multivariate tests generate all possible combinations of your changes, it is not recommended to create a large number of variations unless you’re running the test on a high-traffic site. On the other hand, running multivariate tests on low-traffic sites will provide poor results and insufficient data to draw any significant conclusions. Be sure to have at least a few thousand monthly visitors to your site before choosing to run a multivariate test.
Example of a multivariate test on an eCommerce product-listing page
When to use each test type
A/B tests will help you answer questions such as: which of the two versions of my page perform better in terms of the visitor’s response to it?
Multivariate tests will answer questions like:
- Do visitors respond better to a video item next to a contact form?
- Or to a webpage with just a contact form and no video item?
- Or to a webpage with a video item but no contact form?
How to measure the effectiveness of the A/B testing platform
One method of determining the effectiveness of an A/B testing platform is to perform an A/A test. This means that you create two or more identical variations and run an A/B test to see how the platform handles the variations. Successful results should show that both variations yield very similar results. You can read further about A/A tests here.
The road to A/B test success
“I didn’t fail the test, I just found 100 ways to do it wrong.” / Benjamin Franklin
When running an A/B test, using a valid methodology is crucial for our ability to rely on the test results to produce better performance long after the test is over. In other words, we try to understand if tested changes directly affect visitor behavior or occur due to random chance. A/B testing provides a framework that allows us to measure the difference in visitor response between variations and, if detected, establishes statistical significance, and to some extent causation.