Controlling for repeated significance testing errors and false positive rates in commercial A/B testing is often perceived as a bit of a pedantic, academic exercise.
In academic research, lax controlling for these issues is regarded as close to unethical, or even an example of academic misconduct. By contrast, many online A/B testing frameworks let you automatically stop or conclude at the moment of significance, and there is blessed little discussion of false positive rates. For anyone running A/B tests in a commercial setting there is little incentive to control your false positives. Why make it harder to show successful changes just to meet a standard no one cares about anyway?
It’s not that easy. It actually matters, and matters a lot if you care about your A/B experiments and what you learn from them.
Evan Miller has written a thorough article on the subject in How Not To Run An A/B Test, but it’s somewhat too advanced to illustrate the effect very well. To demonstrate how much it matters, I’ve run a simulation of how much impact you should expect repeated testing errors to have on your success rate.
Here’s how the simulation works:
- It runs 1.000 experiments, each with 200.000 (fake) participants divided randomly into two experiment variants.
- The conversion rate is 3% in both variants.
- Each individual participant gets randomly assigned to a variant, and either the “hit” or “miss” group based on the conversion rate.
- After each participant, a g-test type significance test is run, testing if the distribution is different between the two variants.
- It then counts every occasion where an experiment hit significance at 90% and 95% probability, then counts every experiment that reached significance at any point.
- As the g-test doesn’t like low numbers, I didn’t check significance during the first 1.000 participants in each experiment.
You can download the script and alter the variables to fit your metrics.
So what’s the outcome? Keep in mind that these are 1.000 controlled experiments where it’s known that there is no difference between the variants.
- 771 experiments out of 1.000 reached 90% significance at some point
- 531 experiments out of 1.000 reached 95% significance at some point
This means if you run 1.000 experiments and don't control for repeat testing error in any way, up to 25% of successful, positive experiments might be explained by a false positive rate. You’ll see a temporarily significant effect in around half of your experiments! Or, to put it differently, if you use an A/B testing package that automatically turns on experiments based on significance alone, you'll probably see your success rate soar regardless of the quality of your changes.
Fortunately, there’s an easy fix. Select your sample size or decision point in advance, and make your decision then. Here are the false error rates when making the decision only at the end of the experiment:
- 100 experiments out of 1.000 were significant at 90%.
- 51 experiments out of 1.000 were significant at 95%.
You still get a false positive rate you shouldn't ignore, but nowhere near as seriously as when you don’t control correctly. And this is what you should expect when running with significance levels like this – this is actually the probability level of 95% you would expect, and at this point you can talk about real hypothesis testing.
This article has previously been posted on my blog at www.einarsen.no.