Summary and conclusion
|1.6% vs 2%||0.5% vs 5%||0.5% vs 5% vs 5.2%||3% vs 5% vs 7%|
There’s no one silver bullet for every case. You need to come up with your priorities and the common use cases.
If you always compare only a control and a trial version than most likely tradition AB testing will give you the best results.
If you want to test multiple versions than go for a multi-armed bandit model. The regret factor of possibly running an inferior version for the sake of distinguish the two better performing variation is too high for those cases. I personally would go for the UCB1 for the higher confidence even though it may have higher regret factor compared to Epsilon greedy which we saw can be very error-prone to anomalies. And anomalies will happen especially because we talk about real users from real world not random probabilities.
Also important to decide how much you want to supervise your experiments. If you cannot keep a close eye on them or just want to run them for undetermined time unsupervised multi-armed bandits are the way to go.
At the end it comes down to never believe in other people’s results. Test your own scenarios with the