Is your A/B testing effort just chasing statistical ghosts?

Controlling for repeated significance testing errors and false positive rates in commercial A/B testing is often perceived as a bit of a pedantic, academic exercise.

In academic research, lax controlling for these issues is regarded as close to unethical, or even an example of academic misconduct. By contrast, many online A/B testing frameworks let you automatically stop or conclude at the moment of significance, and there is blessed little discussion of false positive rates. For anyone running A/B tests in a commercial setting there is little incentive to control your false positives. Why make it harder to show successful changes just to meet a standard no one cares about anyway?

It’s not that easy. It actually matters, and matters a lot if you care about your A/B experiments and what you learn from them.

Evan Miller has written a thorough article on the subject in How Not To Run An A/B Test, but it’s somewhat too advanced to illustrate the effect very well. To demonstrate how much it matters, I’ve run a simulation of how much impact you should expect repeated testing errors to have on your success rate.

Here’s how the simulation works:

It runs 1.000 experiments, each with 200.000 (fake) participants divided randomly into two experiment variants.
The conversion rate is 3% in both variants.
Each individual participant gets randomly assigned to a variant, and either the “hit” or “miss” group based on the conversion rate.
After each participant, a g-test type significance test is run, testing if the distribution is different between the two variants.
It then counts every occasion where an experiment hit significance at 90% and 95% probability, then counts every experiment that reached significance at any point.
As the g-test doesn’t like low numbers, I didn’t check significance during the first 1.000 participants in each experiment.

You can download the script and alter the variables to fit your metrics.

So what’s the outcome? Keep in mind that these are 1.000 controlled experiments where it’s known that there is no difference between the variants.

771 experiments out of 1.000 reached 90% significance at some point
531 experiments out of 1.000 reached 95% significance at some point

This means if you run 1.000 experiments and don't control for repeat testing error in any way, up to 25% of successful, positive experiments might be explained by a false positive rate. You’ll see a temporarily significant effect in around half of your experiments! Or, to put it differently, if you use an A/B testing package that automatically turns on experiments based on significance alone, you'll probably see your success rate soar regardless of the quality of your changes.

Fortunately, there’s an easy fix. Select your sample size or decision point in advance, and make your decision then. Here are the false error rates when making the decision only at the end of the experiment:

100 experiments out of 1.000 were significant at 90%.
51 experiments out of 1.000 were significant at 95%.

You still get a false positive rate you shouldn't ignore, but nowhere near as seriously as when you don’t control correctly. And this is what you should expect when running with significance levels like this – this is actually the probability level of 95% you would expect, and at this point you can talk about real hypothesis testing.

This article has previously been posted on my blog at www.einarsen.no.

Is your A/B testing effort just chasing statistical ghosts?

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112