Thursday, July 05, 2012

Puzzling outcomes in A/B testing

A fun upcoming KDD 2012 paper out of Microsoft, "Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained" (PDF), has a lot of great insights into A/B testing and real issues you hit with A/B testing. It's a light and easy read, definitely worthwhile.

Selected excerpts:
We present ... puzzling outcomes of controlled experiments that we analyzed deeply to understand and explain ... [requiring] months to properly analyze and get to the often surprising root cause ... It [was] not uncommon to see experiments that impact annual revenue by millions of dollars ... Reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts.

When Bing had a bug in an experiment, which resulted in very poor results being shown to users, two key organizational metrics improved significantly: distinct queries per user went up over 10%, and revenue per user went up over 30%! .... Degrading algorithmic results shown on a search engine result page gives users an obviously worse search experience but causes users to click more on ads, whose relative relevance increases, which increases short-term revenue ... [This shows] it's critical to understand that long-term goals do not always align with short-term metrics.

A piece of code was added, such that when a user clicked on a search result, additional JavaScript was executed ... This slowed down the user experience slightly, yet the experiment showed that users were clicking more! Why would that be? .... The "success" of getting users to click more was not real, but rather an instrumentation difference. Chrome, Firefox, and Safari are aggressive about terminating requests on navigation away from the current page and a non-negligible percentage of clickbeacons never make it to the server. This is especially true for the Safari browser, where losses are sometimes over 50%.

Primacy effect occurs when you change the navigation on a web site, and experienced users may be less efficient until they get used to the new navigation, thus giving an inherent advantage to the Control. Conversely, when a new design or feature is introduced, some users will investigate the new feature, click everywhere, and thus introduce a "novelty" bias that dies quickly if the feature is not truly useful.

For some metrics like Sessions/user, the confidence interval width does not change much over time. When looking for effects on such metrics, we must run the experiments with more users per day in the Treatment and Control.

The statistical theory of controlled experiments is well understood, but the devil is in the details and the difference between theory and practice is greater in practice than in theory ... It's easy to generate p-values and beautiful 3D graphs of trends over time. But the real challenge is in understanding when the results are invalid, not at the sixth decimal place, but before the decimal point, or even at the plus/minus for the percent effect ... Generating numbers is easy; generating numbers you should trust is hard!
Love the example of short-term metrics improving when they accidentally hurt search result quality (which caused people to click on ads rather than search results). That reminds me of a problem we had at Amazon where pop-up ads won A/B tests. Sadly, pop-up ads stayed up for months, until, eventually, we could show that they were hurting long-term customer happiness (and revenue) even if they showed higher revenue in the very short-term, and finally we were able to take pop-up ads down.

The whole paper is a great read. The authors have a lot of experience with A/B testing in practice and all the problems you encounter with A/B testing in practice. Definitely good to learn from their experience.

4 comments:

Anonymous said...

Thanks for posting! Would you be able to share any insight into the strategy and methodology used at Amazon for the long-term value measurement of users who saw pop-up ads? In my experience, it can be quite difficult to run long-term a/b experiments on a rapidly evolving web product.

Jeff Clites said...

It seems almost hopeless; even if your testing methodology is unbiased, your subsequent analysis will almost certainly be. (Unexpected results will be analyzed until they make sense, and expected results will be taken almost at face value--it's hard to dig for a deeper explanation when a straightforward one is at hand.) Bummer.

jeremy said...

One question I've long had about A/B testing is how it holds up, not under different versioning, but under true system evolution. Not because you can't compare two different systems using the same metric (KPI/OEC). Rather, because what the system is doing or trying to do itself is changing, as it evolves. In other words, the metrics themselves need to change along with the system. But if you're metric isn't constant, then there is no hope of ever doing a valid scientific comparison, right?

So what I'm asking is whether the "puzzling outcomes" questions that these authors are asking are really just manifestations of an underlying drift in the metric itself. A drift that the authors might not even be realizing that they subconsciously want the metric to make?

This is a philosophical question rather than a technical one.

Stephanos Ballmerfeld said...

You deserve a raise!