Multi variant AB testing vs Multi-Armed bandit

0.5% vs 5%

Payoff

Payoff from variation 0.5% vs 5%

Payoff from variation 0.5% vs 5%

This is a bit tricky because of the high confidence level all the test could have been stopped much earlier and it wasn’t necessary to run through the 25000 visitor. You can see the result at the bottom what happens if we stop the experiment when the confidence reaches 95%.

Certainty

Due to the high difference in the two version each model converges very quickly to 100% confidence. Surprisingly UCB1 even faster than AB testing and Epsilon greedy is the slowest but still in less than ~500 visitor is enough for a statistically relevant result.

Confidence for 0.5% vs 5%

Confidence for 0.5% vs 5%

Run behaviour

If we let it run through the 25000 visitors the show frequency will be as the following.

Variations displayed in AB model

Variations displayed in AB model

Variations displayed in Epsilon greedy model

Variations displayed in Epsilon greedy model

Variations displayed in UCB1 model

Variations displayed in UCB1 model

The Epsilon greedy run plot shows very well why it yields the best payoff and the least regret factor in contrast to the AB testing. In such case when the difference is 10 fold the epsilon greedy’s behaviour of spending a short time exploring and more time in exploiting pays off very well while AB test shows the inferior version that doesn’t convert users to revenue longer (in case not stopped earlier).

The UCB1 correctly determines the superior version and shows it more and more with some further probing on the way.

Stopping when confidence level reached

Payoff from variation 0.5% vs 5% if stopped

Payoff from variation 0.5% vs 5% if stopped at confidence level

Here Epsilon greedy became the most expensive and exactly because of it’s behaviour of serving regardless of the conversion rate only depending on the epsilon decay parameter. In such extreme cases AB testing and UCB1 are clearly superior with a slight extra for AB testing. However it’s important to note that the difference is almost negligible in. Under 5% between the worst and the best case.