Thursday, January 06, 2011

The experiment infrastructure at Google

Papers that reveal details of Google's internal systems are always fun. At KDD 2010 a few months ago, four Googlers presented "Overlapping Experiment Infrastructure: More, Better, Faster Experimentation" (PDF).

The paper describes Google's tools for handling the challenging task of running many experiments simultaneously and includes tidbits on how they launch new features. Some excerpts:
We want to be able to experiment with as many ideas as possible .... It should be easy and quick to set up an experiment ... Metrics should be available quickly so that experiments can be evaluated quickly. Simple iterations should be quick ... The system should ... support ... gradually ramping up a change to all traffic in a controlled way.

[Our] solution is a multi-factorial system where ... a request would be in N simultaneous experiments ... [and] each experiment would modify a different parameter. Our main idea is to partition parameters into N subsets. Each subset is associated with a layer of experiments. Each request would be in at most N experiments simultaneously (one experiment per layer). Each experiment can only modify parameters associated with its layer (i.e., in that subset), and the same parameter cannot be associated with multiple layers ... [We] partition the parameters ... [by] different binaries ... [and] within a binary either by examination (i.e., understanding which parameters cannot be varied independently of one another) or by examining past experiments (i.e., empirically seeing which parameters were modified together in previous experiments).

Given this infrastructure, the process of evaluating and launching a typical feature might be something like: Implement the new feature in the appropriate binary (including code review, binary push, setting the default values, etc) ... Create a canary experiment (pushed via a data push) to ensure that the feature is working properly ... Create an experiment or set of experiments (pushed via a data push) to evaluate the feature ... Evaluate the metrics from the experiment. Depending on the results, additional iteration may be required, either by modifying or creating new experiments, or even potentially by adding new code to change the feature more fundamentally ... If the feature is deemed launchable, go through the launch process: create a new launch layer and launch layer experiment, gradually ramp up the launch layer experiment, and then finally delete the launch layer and change the default values of the relevant parameters to the values set in the launch layer experiment.

We use real-time monitoring to capture basic metrics (e.g., CTR) as quickly as possible in order to determine if there is something unexpected happening. Experimenters can set the expected range of values for the monitored metrics (there are default ranges as well), and if the metrics are outside the expected range, then an automated alert is fired. Experimenters can then adjust the expected ranges, turn off their experiment, or adjust the parameter values for their experiment. While real-time monitoring does not replace careful testing and reviewing, it does allow experimenters to be aggressive about testing potential changes, since mistakes and unexpected impacts are caught quickly.
One thing I like about the system they describe is that the process of launching is the same as the process for experimentation. That's a great way to set things up, treating everything to be launched as an experiment. It creates a culture where every change to be launched needs to be tested online and experiments are not treated so much as tests to be taken down when done as candidates to be sent out live as soon as they prove themselves.

Another thing I like is the real-time availability of metrics and ability to very quickly change experiment configurations. Not only does that allow experiments to be shut down quickly if they are having a surprisingly bad impact which lowers the cost of errors, but also it speeds the ability to learn from the data and iterate on the experiment.

Finally, the use of standardized metrics across experiments and an "experiment council" of experts who can be consulted to help interpret the experimental results is insightful. Often, results of experiments are subject to some interpretation, unfortunately enough interpretation that overly eager folks at a company can attempt to torture the data until it says what they want (even when they are trying to be honest), so an effort to help people keep decisions objective is a good idea.

One minor but surprising tidbit in the paper is that binary launches are infrequent ("weekly"); only configuration files can be pushed frequently. I would have thought they would push binaries daily. In fact, reading between the lines a bit, it sounds like developers might have to do a bit of extra work to deal with infrequent binary pushes, trying to anticipate what they will want during the experiments and writing extra code that can be enabled or disabled later by configuration file, which might interfere with their ability to rapidly learn and iterate based on the experimental results. It also may cause the configuration files to become very complex and bug-prone, which is alluded to in the section of the paper talking about the need for data file checks. In general, very frequent pushes are desirable, even for binaries.

1 comment:

Greg Linden said...

Hi, Barry. Thanks, that's a good point about pushing binaries in a distributed environment. It might also have something to do with the size of the binary. Amazon had a problem several years ago with a very large binary that took a long time to push (which it solved mostly by splitting into many smaller binaries).

But, even if there are reasons that frequent pushes of binaries can be difficult, the point still stands that infrequent pushes impose a lot of hidden costs on developers and the company: It slows the pace of iteration. It forces developers to write more code in case it might be needed. It transfers complexity from code to configuration files where it can be harder to test and debug. It complicates scheduling and slows launches. And, it creates more errors overall by launching large numbers of changes in each push (aka "big bang launch"), which adds much complexity and is harder to test.

Because of those hidden costs, I think it is better to push binaries more often than weekly.