Sunday, June 03, 2007

The perils of tweaking Google by hand

An article in the NYT business section today by Saul Hansell, "Google Keeps Tweaking Its Search Engine", has intriguing details on Google's ranking algorithm from discussions with Googlers Amit Singhal and Udi Manber.

Some excerpts:
When it comes to the search engine — which has many thousands of interlocking equations — [Google] has to double-check the engineers' independent work with objective, quantitative rigor to ensure that new formulas don't do more harm than good.

Recently, a search for "French Revolution' returned too many sites about the recent French presidential election campaign -- in which candidates opined on various policy revolutions -- rather than the ouster of King Louis XVI. A search-engine tweak gave more weight to pages with phrases like "French Revolution" rather than pages that simply had both words.

Typing the phrase "teak patio Palo Alto" didn't return a local store called the Teak Patio .... Mr. Singhal's group [wrote] a new mathematical formula to handle queries for hometown shops.

Is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until now, Google has preferred pages old enough to attract others to link to them ... [Singhal's] team's solution: a mathematical model that tries to determine when users want new information and when they don't. (And yes, like all Google initiatives, it had a name: QDF, for "query deserves freshness.")
I found this surprising. Google manually comes up with tweaks to its search engine that only apply to a small percentage of queries, tests the tweaks, and then tosses them into the relevance rank?

The problem with these manual tweaks is that they rapidly become unwieldy. As you add hundreds or thousands of these hand-coded rules, they start to interact in unpredictable ways. When evaluating a new rule, it becomes unclear if performance of that rule might be improved by tweaks to the rule, tweaks to other rules, or removing other rules that have now been subsumed.

It appears Google has hit this problem head on. The "many thousands of interlocking equations" require a "closely guarded internal [program] called Debug" that attempts to explain whether the rules are doing "more harm than good."

Frankly, I thought Google was beyond this. Rather than piling hack upon hack, I thought Google's relevance rank was a giant, self-optimizing system, constantly learning and testing to determine automatically what works best.

How would this work? With a search engine the size of Google's, every search query can be treated as an experiment, every interaction as an opportunity to learn and adapt.

In each query, a few of the results would be different each time. Each time, the search engine is making a prediction on the impact (usually an anticipated slight negative impact) of making this change. Wrong predictions are surprises, opportunities to learn, and are grouped with other wrong predictions until the engine can generalize and attempt a broader tweak to the algorithm. Those broader tweaks are automatically tested, integrated if they work, and the cycle repeats.

For example, on the query [teak patio palo alto], experiments may show the Teak Patio store in Palo alto is getting unexpectedly high clickthrough. Another query, for [garden stone seattle], is showing similar problems in experiments. In clustering, both queries are classified as local. Both show modest purchase intent. The clickthrough urls have been classified as local businesses. On the known data, the most general rule based on this result, one boosting local businesses when a query is classified as local with purchase intent, appears to give a lift. The rule is tested live on a percentage of users, results match predictions, and the system adds the new rule for all users. The process repeats.

Similarly, a group of queries that match known names (e.g. celebrities) may show that links classified as sites are appearing too low in rankings. In automated testing, a general version of this rule performs poorly unless another rule with strong overlap is removed; a more specific rule performs well but applies less frequently. The older rule is removed, the newer general rule put in place.

While this process does result in specific tweaks to the engine like the manual approach, it does not rely on someone manually finding the rule. Unexpected tweaks to relevance rank may arise from the data. Moreover, a self-optimizing relevance rank does not rely on someone manually coming back to rules to maintain or debug them over time.

This approach would require massive computational power -- a huge infrastructure classifying and clustering queries and urls into hierarchies, a framework for testing billions of changes and tweaks simultaneously and generalizing from the results -- but I thought Google had that power already.

Perhaps this merely shows how much further there is to go in search. As Larry Page recently said, "We're probably only 5 per cent of the way there."

Update: Five weeks later, there are a few more tidbits about Google's experiments and their tweaking of their relevance rank in an interview of Udi Manber by Eric Enge. Some excerpts:
We run literally thousands of experiments a year and pick the ones that score well.

We have projects that their sole purpose is to reduce complexity. A team may go and work for two months on a new simpler sub-algorithm. If it performs the same as the previous algorithm, but it's simpler, that will be a big win.

Overall, we have to be very careful that the complexity of the algorithm does not exceed what we can maintain.

5 comments:

or said...

Except that I would not take some few and simple quotes and generalize it to understand how google works with this stuff. Obviously, they did not reveal to the reporters the real "secret sauce". There is much they did not say. Its possible that they have something similar to what you are saying. Manual tweaks will be necessary at times.

Anonymous said...

.... and it is not worth getting too over-awed either. There is a certain Wizard of Oz quality to all this.

After all it is in their interest to project an aura of magic and wizardry, if only to persuade (prospective) competitors that the game is hopeless. A lot of their "talent" hiring smacks of that, when you think about it.

Matt said...

Just to agree with the previous comment, I wouldn't draw quite the same inferences as you did, Greg.

I would say that while machine learning is important at Google, it doesn't completely remove the aspect of engineers looking to improve the system. Rather than tweaks, I'd view the efforts more as tuning the existing algorithms and looking for additional algorithms that can improve search as well. In the same way that you probably wouldn't be content with the personalization algorithm(s) from Findory and never changing them, folks at Google are looking for ways to improve their algorithms, even as the algorithms run and do many aspects of ranking/personalization automatically.

jeremy said...

Greg, two reactions:

In each query, a few of the results would be different each time. Each time, the search engine is making a prediction on the impact (usually an anticipated slight negative impact) of making this change. Wrong predictions are surprises, opportunities to learn, and are grouped with other wrong predictions until the engine can generalize and attempt a broader tweak to the algorithm.

First, I am impressed by the timeliness of your suggestions. Donna Harman (one of the principle motivating forces behind the original TREC in the early 90s, which arguably enabled the biggest improvements in IR systems, pre-Google) gave a talk at RIAO last week. And she addressed exactly this issue. And gave exactly this solution. She basically took everyone to task for not doing more of this failure analysis.. noticing and extracting the queries that fail, and then generalizing solutions. It is not just Google that is failing to do more of this; the rest of academia is fairly poor in this respect, too.

Frankly, I thought Google was beyond this. Rather than piling hack upon hack, I thought Google's relevance rank was a giant, self-optimizing system, constantly learning and testing to determine automatically what works best.

Second reaction: I am not as surprised as you are. In talking with folks from Google over the years, I have fairly consistently received the message that Google is more interested in engineering than science. In fact, at SIGIR 2004, the very same Amit above said that if you are interested in getting hired by Google, you should become a "hacker" (his word, in quotes). Of course, he meant this in typical geek fashion.. hacker as a tinkerer and engineer, rather than hacker as cracker. But with the emphasis on "hacking" rather than methodical science, it is not as surprising that improvements to the engine are "hacks" and "tweaks" rather than principled, scientifically-generalized solutions. This is not meant as a criticism, as I certainly am not always as rigorous as I should be in my own work. I am only making the observation that what they report is not so out of character as it seems, given their emphasis on engineering over science.

clam said...

The real difficulty in determining search quality is that it's not well defined and not easily evaluated from logs. You can notice egregious failures (user reformulates query too many times), but it's hard to detect "minor" failures or if the user has settled on a "good enough but not great" result.

Even if you can learn from failure, you'd still not want to rely on it too much for usability reasons. The feedback you get is necessarily noisy and you need a sufficient number of failures to generalize them to a pattern. Ideally your judgement should notice early warning signs *before* failures have occurred to users. (However, having said that, I do agree with jeremy/Donna Harman that automatic failure analysis is a big opportunity because so little of it is done right now, and it's more scalable.)

As for manual tuning, I'm not surprised at all. In fact, I believe Google has an army of low-paid drones that use different versions of their search engine and report them to the engineering "priesthood" to decide which hacks to put into production.

You know how a lot of infomercials always have a "doctor" testifying to the effectiveness of the product and some computer graphic that shows a "scientific simulation" of how the product works. Same thing here and NYT took the bait.