Multi variant AB testing vs Multi-Armed bandit

Intro

I’ve read a lot of discussions lately around different versions of experimenting/testing. People are seemed to be very religious about the subject and two cardinal questions are separating the groups. Frequentist vs. Bayesian approach and AB testing vs. Multi-armed bandit solutions. Right now I’m mostly interested in the second problem because the first is even more dogmatic and my personal preference is bayesian since the question it answers and the result it produces is closer to human thinking. I’m not going into more details because if you type ‘frequentist vs bayesian’ to google you will see plethora of results arguing for one or the other. I would highlight only two of those:

AB testing or Multi-Armed bandit

I’re read theoretical proofs of why one is superior over the other and also some artificial examples of stating the opposite. Just like with many other things I like to take the empirical path and see from real world examples how one compares to the other.

Disclaimer: I’m an engineer not a statistician therefore I’m only going to limit my conclusions to the tangible results without going into the complex math of the different models.

Prerequisites

Before I was about to code anything I came up with some scenarios describing my most likely use cases (I want to emphasize these are my scenarios from projects I’ve been involved in and situations I’ve been encountered with). This keeps me objective over the testing phase and prevent me making even unconscious biases towards one or another solutions.

I didn’t come up with the testing algorithm. I used solutions from different post from different websites.

For the reasons described above I will use bayesian statistical model to determine confidence in the results. I know many of the traditional AB testing uses frequentist statistics with z-score and p value but to have a fair comparison and to compare apples to apples I will stick with the same calculus on those as well.

Criteria

I don’t care too much about theoretical proofs of a hypothesis being true or false. I’m much more pragmatic than that and I care about maximizing the revenue aka. serving the best performing version to the most of the users as fast as possible and willing to spend some conversion on finding those best options (this is referred as regret factor in many places which means the times when the inferior version was shown to the user to gain trust in the results).

I’m going to measure the confidence level of each cases with every model using bayesian statistics [1] and the overall payoff of each model having one conversion yielding 1 value. To visualize the run of the models I also plot the serving frequency of the version over time (visitors).

Test cases

I’m going to serve my version to 25000 visitors using traditions AB testing, Multi-armed bandit with epsilon greedy with an epsilon decay of 2000 [2] and Multi-armed bandit with UCB1 algorithm [3].

The serving frequency for epsilon greedy is going to be the same in every test case regardless of the conversion rates because it only depends on the epsilon decay parameter. Same is true for AB testing which always shows 50% / 50%. Nevertheless I will include the plots so it will be easy to compare. 

1.6% vs 2%

The most basic case when you have a small improvement in the website which results in a small gain in conversion rate.

0.5% vs 5%

Another typical case when you have introduced a new features resulting in significant increase in conversion rate (or decrease in cart abandonment for example).

0.5% vs 5% vs 5.2%

This is a bit of a twist of the previous example but far from artificial. If you’re in a continuous improvement loop and you’re rolling out features frequently you might have more idea/design for a certain feature. Let’s say you put and urgency on some product but you can put the message box next to the cart button or next to the availability. Both will have significant effect but slightly different.

3% vs 5% vs 7%

This is when you truly have 3 different competing version. Sort of a extension of the first examples with different numbers.

The datapoints will be grouped for 100 visitors.