Saturday, June 30, 2007

A/B testing at Amazon and Microsoft

Ron Kohavi, Randal Henne, and Dan Sommerfield from Microsoft have a paper on A/B testing, "Practical Guide to Controlled Experiments on the Web" (PDF), at the upcoming KDD 2007 conference.

Ronny Kohavi was at as Director of Personalization and Data Mining for about two years (Sept 2003 - June 2005). The paper contains some mentions of Amazon's A/B testing framework (which was developed in the 1990s, but has been continuously refined since then) and other useful information on running experiments on a live website.

Some excerpts from the paper:
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called ... A/B tests.

The authors of this paper were involved in many experiments at Amazon, Microsoft, Dupont, and NASA. The culture of experimentation at Amazon, where data trumps intuition, and a system that made running experiments easy, allowed Amazon to innovate quickly and effectively.

Controlled experiments provide a methodology to reliably evaluate ideas ... Most organizations have many ideas, but the return-on-investment (ROI) for many may be unclear ... A live experiment goes a long way in providing guidance as to the value of the idea.

Many theoretical techniques seem well suited for practical use and yet require significant ingenuity to apply them to messy real world environments. Controlled experiments are no exception. Having run a large number of online experiments, we now share several practical lessons:

A Treatment might provide a worse user experience because of its performance ... because it is slower ... Compute the minimum sample size needed for the experiment ... We recommend that 50% of users see each of the variants in an A/B test ... A small [win] ... may not outweigh the cost of maintaining the feature ... Running frequent experiments and using experimental results as major input to company decisions and product planning can have a dramatic impact on company culture.
Amusingly, some details from a couple of the posts on this weblog are quoted at a couple points in the paper.

Ronny also gave a talk (PDF) at eBay Research Labs earlier this month that covered similar material.

See also Dare Obasanjo's post on Ronny's paper and his eBay Labs talk.

See also "Front Line Internet Analytics at" (PDF), a 2004 talk by Ronny Kohavi and former Personalization Director Matt Round that has more details on's A/B testing framework.

Update: It appears that all three of the authors of the paper -- Ron Kohavi, Randal Henne, and Dan Sommerfield -- were at Randal Henne was at Amazon Apr 2003 - May 2006. Dan Sommerfield was there Dec 2003 - Jul 2006.

Update: Four months later, Jeff Bezos discusses the value of experimentation at Amazon in a HBR interview [Found via Werner Vogels].


Anonymous said...

The culture of experimentation at Amazon, where data trumps intuition, and a system that made running experiments easy, allowed Amazon to innovate quickly and effectively.

I like to think of myself as an adherent to the "data trumps intuition" philosophy.

But I wonder sometimes.. what does one do when the property that one wants to evaluate is not (easily? at all?) measurable via a web interface?

Over the web, you can measure things like what is clicked, how much time is spent on a page, etc. But what if the change you are trying to make is less tangible than that? For example, suppose you want to give the user some additional source of information about an item on Amazon. That extra information is not necessarily clickable, so you can't really measure clicks. That information also might help the user decide -not- to purchase the item, rather than purchase the item -- something that is actually very useful to the user even though it does not convert to additional sales income for Amazon. So you can't really measure a new feature by how often something is purchased.

Take, for example, a recent feature (or at least I just noticed it recently -- it might have been around for longer) on Amazon: A user star-rating histogram. At the top of the reviews, Amazon has started showing a small histogram with the relative frequencies of users giving a products a 5, 4, 3, 2 or 1 star rating.

I have long (for maybe 7+ years?) wanted some sort of information like this, some sort of information beyond just the average star rating. Well, originally I wanted Amazon to show the standard deviation of the ratings, in addition to the mean. But a histogram is equally useful, if not more interpretable, by the masses.

So suppose you are making the decision to show or not show this histogram. Version A has the histogram, Version B does not. How do you actually go about measuring whether or not showing the histogram is a better approach? Especially when no one is actually clicking the histogram, and/or seeing the histogram might provide enough information to the user so that they decide -not- to purchase something?

Apologies if my question is naive. This type of user testing is not really my primary area of expertise. I'm just curious.

Anonymous said...

Jeremy, we feel the star system hides too much information and have built a system to really examine that space. Check out, and let us know what you think.


Anonymous said...

Abdur, that's great. So if I understand it correctly, on you are showing not only the distribution over the various star levels, but also some sense of absolute frequency / quantity of judgements available for each item.

That is so much better, IMHO, than a system that just shows you the average star rating, without giving you any sense of the variance, skew, and sample size.

But that still doesn't answer my question. My question is: How do you do A/B testing on your interface? This might be painfully obvious to all of you, and I am just the odd duck here. But is it that you're actually measuring, that will tell you that putting those range bars next to a result is "better" than just putting the average star rating?

Intuitively, I agree with you that your approach is better. But the point of Greg's post, above, is to point out how Amazon goes beyond intuition, and gathers empirical results.

So how do you actually test the benefits of your change? Does more clicks mean that the change is better? Does fewer clicks mean that the change is better? Does longer time spent viewing a page mean that the change is better? Shorter time spent viewing a page? More items purchased after fewer items looked at? Fewer items purchased after more items looked at? What is better?

Do you see what I am asking here? How do you actually test an information visualization tool, when that tool never gets directly utilized (i.e. you don't click the visualization to make a purchase)?

Andrew Hitchcock said...

Jeremy, wouldn't the information just be the number of purchases with the new histogram? If adding the histogram notably decreases how likely someone is to purchase an item, then it probably isn't a good feature to add (from a revenue stand point).

Anonymous said...

Andrew, whatever the form of the new information, my point or my question is still "how do you A/B test it?" Or, "by what criteria do you say that A is better than B?"

You write "If adding the histogram notably decreases how likely someone is to purchase an item, then it probably isn't a good feature to add (from a revenue stand point)."

In a way, that answers my question. You are evaluating A and B not by end user satisfaction, but by bottom line revenue to the company.

And if that's the metric, I can see why someone following that metric might not place much value on information visualization and exploration. Such features do not necessarily yield "immediate" payoff

For example, suppose I had $300 burning a hole in my pocket, and I knew that, no matter what, I was going to spend it on a consumer point-and-shoot digital camera. The only question is: which camera?

So I start using Amazon to explore. I look at features. I read reviews. Etc.

Now, it is well known that consumers only have a finite amount of attention or effort they can give toward a particular task, and that after consumer patience has worn out, consumers usually just satisfice, and pick a camera that might not be the absolute best along all dimensions, but that satisfies some or most of them. When it starts to take 15% more effort to find that one camera that is 5% better, users give up the hunt and just buy what they have already found.

What if, on the other hand, users were presented with a new interface such as the new histogram from Amazon, or the interface from Abdur. Then users might indeed spend less time on any one product, but instead be able to sift through many more products, much quicker. Because the information visualization helps them more quickly assess how worthwhile (or not!) it is to read through all of the reviews, they can spend less time reading the reviews of something that they wouldn't have purchased anyway, and instead spend more of their finite attention/effort on reading reviews of a product that they might not have found, if they had spent that attention/effort evaluating products that they would not have bought, anyway!

Thus, with the same amount of total user effort, the user can find a camera that is a much better fit for that user's life.

So how do you A/B test that? If you're just measuring how much money a consumer spends as a whole, then both interfaces A and B are equal, because in both cases the consumer spent the exact same amount: $300. Interface B might actually be better, but your metric does not actually tell you this.

On the other hand, if you are measuring on a per product basis (i.e. percentage of users who looked at a product, who then ended up buying it), the interface with the information visualization is actually going to appear to be much worse! Because the interface allows people to more quickly explore more products with less effort, and because people still only end up buying a single product (camera in our example), the per-product purchase rates will, of sheer algebraic necessity, decline.

And yet, this latter interface actually improves user satisfaction, without lessening the overall revenue of the company as a whole! At the same time, you can't just change your metric to something like "the better interface is the one in which users start looking at more products", can you? Because based on clicks alone, it seems to me very difficult to tell the difference between a user who is successfully exploring, and becoming more satisfied because he can churn through more products, quicker, and the user who is frustrated and thrashing, because he is not finding what he is looking for.

Am I at all on-base, painting the dilemma in this manner? My question remains: How do you actually measure some of these things? How do you actually measure whether A or B is better?

Greg Linden said...

Hi, Jeremy. I think the assumption is that improvements in user satisfaction are correlated with increases in revenue, even short-term revenue.

There are ways to violate this assumption -- for example, pop-up ads appear to yield a short-term lift in revenue but reduce user satisfaction -- but I think the assumption holds on average in most cases.

Unfortunately, we often have to make do with imperfect measures. It is too expensive to measure user satisfaction or long-term gains in operating margins, so we have to use what we can get easily.

Anonymous said...

Thanks for the update, Greg.

What about in other contexts, where you do not have this purchase/profit information?

For example, I once got into a discussion with a Google engineer/product manager about the result count information that Google displays in its searches.. you know.. the "Results 1 - 10 of about 15,000,000 for [your query]" information.

Robert Scoble showed, about a year ago, how inaccurate that information really is. He made up a word ("brrreeeport") that had zero hits on Google WEB search (because it did not exist) and asked bloggers to write this word in their blogs. Google blog search then showed after 2 or 3 days that there were approximately 600-700 blogs that contained this word. At the same exact time (after this same 2-3 days), Google WEB search showed 173,000 web results for this same heretofore non-existent term.

In other words, that 173,000 figure was completely useless / inaccurate.

And yet, Google still decides to both compute and show that information.

I asked the Google engineer why Google that information is so inaccurate. His response was "well, it really doesn't matter, because people don't really use that information anyway".

My question at that point became, "why would the Google UI team decide to show that information, then, if (1) the information is inaccurate, and (2) no one uses it anyway?" He didn't really have a response.

But I have to believe that at some point, Google did A/B testing on that feature, and decided whether or not to show it. It takes compute power to calculate those numbers, and time and bandwidth to transmit those extra bytes. So if users really are not using that information, as this Google engineer clained, that is wasted bandwidth and compute cycles. And slows down response time, which is a cardinal Google sin. And yet, Google still shows that information.

So how, in an A/B testing world, does one actually evaluate whether or not showing that information is useful? Especially when users never interact directly with that information, i.e. users cannot click or otherwise perform any useful action on the "Results 1-10..." statement, directly.

So, if users click more links when that information is shown, and fewer links when it is not shown, does that mean showing the information is better? What if the opposite is true.. and users click fewer links when that information is shown? Does that mean showing that information is better?

And does showing that information actually have any sort of effect on how many links are actually clicked? If it doesn't, how does empirically (rather than intuitively) decide whether or not to show that info?

I personally, intuitively believe that information is generally useful, even despite its inaccuracies. I'm a strong believer in data visualization and exposing more internal information to the user, because I believe it allows the users to make better choices. But because this information is a peripheral visualization, and not interacted with directly, I am hard pressed to explain exactly how you would A/B test that intuition.

I open this question up not just to you, Greg, but to your readers as well.

Abdur, where didja go? :-) What is your feeling about this?

Chuck Lam said...

Jeremy, I think you're right that A/B testing often miss things that are difficult to measure or that have long term effects. However, it may still be the case that data trumps intuition. It just means that you need data from user studies, focus groups, etc. to complement your A/B testing data.