Friday, May 30, 2008

Machines versus humans at Google

A curious revelation from Googler Peter Norvig appears in a recent post by Anand Rajaraman:
[To execute a web search] a subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines ... 200 [pre-computed] signals in-memory using a proprietary formula.

[This] appears to be made-to-order for machine learning algorithms. Tons of training data (both from usage and from the armies of "raters" employed by Google), and a manageable number of signals (200) -- these fit the supervised learning paradigm well, bringing into play an array of ML algorithms from simple regression methods to Support Vector Machines.

And indeed, Google has tried methods such as these. Peter tells me that their best machine-learned model is now as good as, and sometimes better than, the hand-tuned formula on the results quality metrics that Google uses.

The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet.

Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.
Update: Anand writes a follow-up post, "How Google Measures Search Quality".


Arthur said...

The google model *is* a machine learned model. It's a form of ML called supervised learning. You teach the computer how to do something, and then the computer does the rest. There's no real way to make the computer do it from scratch. The computer doesn't know anything about how humans perceive information. We have to teach it that. The human component will always be there, no matter how advanced ML gets.

Anonymous said...

This is really strange. Google's execs have always claimed that the way to intelligence (the Google way) is via huge amounts of data and learning. Interesting that they are not applying this to ranking.