At the dawn of experimentation, statisticians provided a very basic framework for statistical inference in A/B testing scenarios. Still commonly used today, The Frequentist Approach, under which Hypothesis Testing was developed, allowed the industry to further investigate theories of behavior and determine whether there is enough statistical evidence to support a specific theory.
Later, differences in how to most effectively interpret probability would give rise to a new, Bayesian Approach – a method many say is less restrictive, highly intuitive, and more reliable (a sentiment also shared by us at Dynamic Yield). For the purpose of this post, I will dive into the numbers behind the Frequentists school of thought, performing the appropriate derivations where necessary.
Maximum Likelihood Estimation:
At a very high level, when performing a traditional A/B test, one starts with a hypothesis – a predictive statement about the impact of a change. Following this, a sample size is calculated to achieve Statistical Significance (95%), the traffic is split among the variations, results measured, and a conclusion reached, either validating or failing to validate the original hypothesis.
Reaching a hypothesis is usually done after an internal brainstorming session, or discovered through focus group research. I will talk about calculating sample size later. The third step in the process, measuring the results, is where things start to get complicated.
If you had to arrive at a clickthrough rate for each variation, can you show each variation to a single user, calculate the clickthrough rate, and conclude which variation is better? Obviously not!
We know intuitively that one click out of 10 impressions is not as reliable as ten clicks out of 100, even though both scenarios have the same clickthrough rates. So, extending this logic, in an ideal scenario, one has to run the test for an infinite amount of time, reach the true conversion rates of both variations, and conclude the test with 100% confidence.
Since this is not an option, we work with Likelihood functions, answering the question: which clickthrough rate of a variation is closest to the ‘true’ clickthrough rate?
Let’s take a quick dive into the mathematics of it:
When you initially start collecting data from your variations, it’s normal to see numbers falling all over the place. So, how can one find a single value that represents the conversion rate of the variation? By finding out the sample mean estimate.
The data collected during the experiment can usually be characterized by an underlying distribution. For example, Gaussian / normal distribution, if you are measuring the height of students in a class, or Bernoulli, if you are mapping a binary variable like a coin toss.
In our case, the clickthrough rate most likely follows a Gaussian distribution. Gaussian distribution is defined by two parameters: mean (μ) and variance ( ² ).
The wider the distribution is, the more far off the sample data is from each other. As you can see, there can be many Gaussian distributions with different mean values that can possibly represent the underlying data.
But there is only one Gaussian distribution which maximizes the probability of seeing the observations – the Maximum Likelihood. The mean value at which this ‘maximum likelihood estimate’ occurs is the ‘sample mean estimate’ we are after. This sample mean estimate value will be the closest to the ‘ideal conversion rate’ given the data collected for a variation.
Deriving the Sample Mean Estimate:
Let’s go back to school and try to derive the sample mean estimate. If you are not interested in the derivation, you can scroll to the bottom of this section for the concluding formula.
I am working with a Gaussian distribution first, here. Later, we will do the same for a Bernoulli.
The Probability Density Function of a Gaussian is:
If you have a Gaussian distribution with a mean and variance and want to calculate the probability that a particular value ‘X’ is going to occur, then you just fill the equation with the value of X and get the probability of that value.
Extending this logic, if one wants to find the joint probability of all the samples collected in an experiment, then because the samples are independent and identically distributed (IID), we can just multiply the probabilities of individual samples as follows:
This likelihood function answers the main question: what’s the probability of seeing these data points given the mean value μ?
Maximum Likelihood maximizes the value of the likelihood function, answering the most critical question: what value of μ maximizes the probability of the data in the samples I am seeing?
In calculus, to maximize a function with respect to a variable, we find the derivative:
dp/dμ = 0, solving for μ
Since the Gaussian is an exponential function, the derivative would, again, be an exponential. To avoid this, we take Log on both sides, reducing the exponential to linear values (the Log is a monotonically increasing function, just like the exponential):
Now, finding the derivative of the loglikelihood with respect to μ, we get:
Therefore, the sample mean estimate is:
The same logic can be applied for a Bernoulli.
The likelihood function of a Bernoulli distribution is:
Following a similar process, taking Log on both sides and performing a derivative w.r.t p, we get the sample mean estimate of a Bernoulli as:
Where x_{i} represents the number of times the events occurred / did not occur. So, the sample mean estimate is similar for both Gaussian and Bernoulli distribution, and both have a very glaring problem.
Problems with Sample Mean Estimate:
As seen before, the sample mean estimate boils down to:
Let’s say you collect two clickthrough rates. For example, 10.5 % & 9.5 %, and apply the formula to calculate sample mean estimate:
(10.5 + 9.5) / 2
At 10%, does it answer how confident you can be in your estimate? Does it quantify the uncertainty? Does it tell you how close or far you are to understanding the ‘true’ clickthrough rate?
Clearly not, which is why we work with Confidence Intervals and PValues – so that we can quantify the uncertainty around the data collected by incorporating the number of samples into the equation.
TestStatistics, Confidence Intervals:
Having calculated the sample mean estimate ( ∧μ ), we can conclude with 95% confidence that the ‘true’ clickthrough rate would lie between:
Where t_{α} = the teststatistic at a 95% confidence level and N1 degrees of freedom (this is a readily available precalculated number).
δ = the sample standard deviation
N = the sample size
Calculating the test statistic:
Now that we have established confidence intervals around our observed data in a test, how do we extend that to our Hypothesis Testing and conclude if our hypothesis is true?
To do that:

We calculate the test statistic as under:

Map the teststatistic to a pvalue

Reject the hypothesis if the value is above the threshold P_{critical} value
Let’s break down each step to understand better.
Step 1: The higher the difference between the sample mean estimates of the variations, the higher the teststatistic. Intuitively, the higher the difference, the clearer the winner is, so a rule of thumb is that the Absolute Value of the teststatistic should always be higher.
Step 2: The teststatistic is a point value on a normal distribution, which shows how far off (in no. of standard deviations) from the mean the teststatistic is. To convert the point estimate into a probability, we need the area under the curve, or mathematically speaking, the cumulative density function. We can do the integral of the tdistribution or the Gaussian distribution and find out how much area under the curve is covered by the teststatistic – this value is called the pvalue. Thankfully these numbers are precalculated and publicly available, so we do not have to go over the nittygritty details
Step 3: Once pvalue is known, interpreting it is pretty straight forward. If the calculated pvalue is less than the P_{critical} value, you set is as the threshold for rejecting the null hypothesis, and if it is more than the P_{critical} value, then you fail to reject the null hypothesis. This does not mean you accept that the null hypothesis is true. More on this later.
pvalue:
A very convoluted metric that can lead to a lot of confusion, the official definition of pvalue is as follows:
“The probability that, when the null hypothesis is true, the statistical summary (such as the sample mean difference between two groups) would be equal to, or more extreme than, the actual observed results.”
To explain, let’s say you start with a hypothesis that both women and men are of the same height. You collect some data, calculate sample mean heights and find 1.9 meters for men and 1.7 meters for women, with a pvalue of 0.03. This pvalue tells you IF the null hypothesis is true, what the probability is you would see the difference (or greater) as measured in your samples.
In other words, IF, in fact, women and men were of the same height, then the probability of us seeing a difference of 0.2 in the mean values between them is 3%. We’d, therefore, reject the hypothesis, as this is extremely rare, and say the alternate is true. The value at which you say the probability is rare enough is the critical value of p. This is usually 0.05 or 0.01, so we are saying you can live with making an incorrect conclusion 5% of the time.
Now, time to talk about errors.
Dealing With Errors in A/B testing:
Attempting to logically interpret an A/B test situation, there are two types of errors that are theoretically possible:

Falsely reject a true null hypothesis (type I or alpha) or

Fail to reject a false null hypothesis (type II or beta)
The table below gives a nice overview:
Decision > Reality  Reject H_{o}  Fail to Reject H_{o} 
H_{o} is True  False Positive (Type I)  True Negative 
H_{1} is True  True Positive  False Negative (Type II) 
It’s easier to think of errors in terms of False Positives and False Negatives. A false positive is when you’ve identified a nonexisting relationship.
How does such a thing happen?
Let’s say you start an A/B test with a pvalue of 0.01. A typical A/B test has a progression of pvalue like below:
The blue line shows pvalue and the green, significance level. As you can imagine, concluding the test before the sample size is large enough would be the wrong decision, resulting in one trying to find evidence when there is none. This is the reason why it’s advised not to ‘peek’ at a running A/B test, for it may lead to incorrect decisionmaking based on inappropriate data.
Incremental Significance:
Achieving significance is not ‘incremental’ by nature in traditional A/B testing. You do not see the results gradually improving over time; you calculate a sample size, refrain from peeking until it’s reached, and then try to draw a conclusion once the estimated sample threshold has been hit.
Failure to do so could be extremely problematic. Whether you are running a flash sale with limited time for iterations or you’re a medical company trying to solve a lifethreatening disease – you would not be able to stop your traditional A/B test or treatment halfway through and start curing people. You’d have to wait until the test was properly completed out of fear of producing a false positive.
Possibility To ExploreExploit:
When running a traditional A/B test where one variation is clearly outperforming another, the losing variation is still shown to a large audience until the statistical significance is reached. While allowing for confident winnerdeclaration, the time spent waiting can result in a loss of potential business.
In an ideal world, you should be able to drop traffic to the losing variation and maximize returns with the winning variation. To provide the losing variation an opportunity to take the lead, we assign a bare minimum percentage of the traffic to it (called Epsilon) and exploit the winning variation. Thus the term ‘Explore vs. Exploit’ (also called ‘epsilongreedy’ in some circles) has come to be, which addresses how much time is “wasted” on learning as opposed to taking advantage of what is already learned. This feature is nearly impossible in a traditional A/B test, unless you are hellbent on producing false positives.
Frequentist Wrap Up:
Recapping everything that has been laid out so far:

Traditional A/B testing can quickly get out of hand with observations including a high standard deviation due to imbalances in the scales tilting the sample mean estimate.

A marginally high standard deviation in the sample population would require a larger number of samples to reach acceptable significance levels.

Achieving significance is not ‘incremental’ by nature in traditional A/B testing. A ‘no peeking’ approach should be followed until the completion of a test.

There is always the possibility of a ‘false positive’ or type1 error, especially if you hurry the test or peek at the results too early.

An ExploreExploit strategy is not possible, and therefore you leave money on the table by continuing to serve a losing variation.

Pvalue is highly convoluted and not intuitive.
For all of these reasons, we at Dynamic Yield prefer the Bayesian approach to A/B testing. Click to read more on exactly how going Bayesian overcomes some of the issues we’ve outlined above presented in traditional A/B testing.