Wednesday, January 26, 2011

Latest reading

A few things that have caught my attention lately:
  • Google generates $24 per user, Yahoo $8, and Facebook $4. ([1])

  • Paul Graham's most important traits in founders. I'd say they even are in priority order. ([1])

  • Groupon competitors suddenly are everywhere. Not surprising, there's no technology barrier here, and it is easy to attract people if you offer 50% off at a popular store. But, I wonder if this booming fad is sustainable. Particularly concerning is that only 22% who use a Groupon offer return to the merchant, which, combined with the steep discounts, make this look expensive. ([1] [2] [3] [4])

  • Google is accused of copying a bunch of someone else's code. And this isn't the first time. Back in 2005, Google was accused of copying much of the code for Orkut. ([1] [2]).

  • Great, updated, and free book chapters on how people spam search engines. I particularly like the new discussion of spamming click data and social sites. ([1])

  • A very dirty little secret: most AOL subscribers don't realize they don't need to be paying, and the revenue from hoodwinking those people is a big part of AOL's profits. ([1])

  • Best article I've seen on why Google CEO Eric Schmidt is stepping down ([1])

  • With this latest move, Amazon closes the gap between their low level cloud services (microleasing of servers and storage) and the higher level offerings of Google App Engine and Microsoft Azure. ([1])

  • The trends of abandoning TV and embracing smartphones have been vastly overstated. Most Americans watch more TV than ever and use dumbphones. ([1] [2]).

  • A claim that Facebook has resorted to scammy malware-type advertisements in an effort to drive revenue. ([1]).

Thursday, January 06, 2011

The experiment infrastructure at Google

Papers that reveal details of Google's internal systems are always fun. At KDD 2010 a few months ago, four Googlers presented "Overlapping Experiment Infrastructure: More, Better, Faster Experimentation" (PDF).

The paper describes Google's tools for handling the challenging task of running many experiments simultaneously and includes tidbits on how they launch new features. Some excerpts:
We want to be able to experiment with as many ideas as possible .... It should be easy and quick to set up an experiment ... Metrics should be available quickly so that experiments can be evaluated quickly. Simple iterations should be quick ... The system should ... support ... gradually ramping up a change to all traffic in a controlled way.

[Our] solution is a multi-factorial system where ... a request would be in N simultaneous experiments ... [and] each experiment would modify a different parameter. Our main idea is to partition parameters into N subsets. Each subset is associated with a layer of experiments. Each request would be in at most N experiments simultaneously (one experiment per layer). Each experiment can only modify parameters associated with its layer (i.e., in that subset), and the same parameter cannot be associated with multiple layers ... [We] partition the parameters ... [by] different binaries ... [and] within a binary either by examination (i.e., understanding which parameters cannot be varied independently of one another) or by examining past experiments (i.e., empirically seeing which parameters were modified together in previous experiments).

Given this infrastructure, the process of evaluating and launching a typical feature might be something like: Implement the new feature in the appropriate binary (including code review, binary push, setting the default values, etc) ... Create a canary experiment (pushed via a data push) to ensure that the feature is working properly ... Create an experiment or set of experiments (pushed via a data push) to evaluate the feature ... Evaluate the metrics from the experiment. Depending on the results, additional iteration may be required, either by modifying or creating new experiments, or even potentially by adding new code to change the feature more fundamentally ... If the feature is deemed launchable, go through the launch process: create a new launch layer and launch layer experiment, gradually ramp up the launch layer experiment, and then finally delete the launch layer and change the default values of the relevant parameters to the values set in the launch layer experiment.

We use real-time monitoring to capture basic metrics (e.g., CTR) as quickly as possible in order to determine if there is something unexpected happening. Experimenters can set the expected range of values for the monitored metrics (there are default ranges as well), and if the metrics are outside the expected range, then an automated alert is fired. Experimenters can then adjust the expected ranges, turn off their experiment, or adjust the parameter values for their experiment. While real-time monitoring does not replace careful testing and reviewing, it does allow experimenters to be aggressive about testing potential changes, since mistakes and unexpected impacts are caught quickly.
One thing I like about the system they describe is that the process of launching is the same as the process for experimentation. That's a great way to set things up, treating everything to be launched as an experiment. It creates a culture where every change to be launched needs to be tested online and experiments are not treated so much as tests to be taken down when done as candidates to be sent out live as soon as they prove themselves.

Another thing I like is the real-time availability of metrics and ability to very quickly change experiment configurations. Not only does that allow experiments to be shut down quickly if they are having a surprisingly bad impact which lowers the cost of errors, but also it speeds the ability to learn from the data and iterate on the experiment.

Finally, the use of standardized metrics across experiments and an "experiment council" of experts who can be consulted to help interpret the experimental results is insightful. Often, results of experiments are subject to some interpretation, unfortunately enough interpretation that overly eager folks at a company can attempt to torture the data until it says what they want (even when they are trying to be honest), so an effort to help people keep decisions objective is a good idea.

One minor but surprising tidbit in the paper is that binary launches are infrequent ("weekly"); only configuration files can be pushed frequently. I would have thought they would push binaries daily. In fact, reading between the lines a bit, it sounds like developers might have to do a bit of extra work to deal with infrequent binary pushes, trying to anticipate what they will want during the experiments and writing extra code that can be enabled or disabled later by configuration file, which might interfere with their ability to rapidly learn and iterate based on the experimental results. It also may cause the configuration files to become very complex and bug-prone, which is alluded to in the section of the paper talking about the need for data file checks. In general, very frequent pushes are desirable, even for binaries.