Thursday, May 29, 2008

Udi on Google search quality

Google VP Udi Manber offers a high level description of what goes into Google's relevance rank in his recent post, "Introduction to Google Search Quality".

Some excerpts:
Ranking is hard ... We need to be able to understand all web pages, written by anyone, for any reason ... We also need to understand the queries people pose [and their needs], which are on average fewer than three words, and map them to our understanding of all documents ... And we have to do all of that in a few milliseconds.

PageRank is still in use today, but it is now a part of a much larger system. Other parts include language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes, and so on), query models (it's not just the language, it's how people use it today), time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time), and personalized models (not all people want the same thing).

In 2007, we launched more than 450 new improvements, about 9 per week on the average. Some of these improvements are simple and obvious -- for example, we fixed the way Hebrew acronym queries are handled (in Hebrew an acronym is denoted by a (") next to the last character, so IBM will be IB"M), and some are very complicated -- for example, we made significant changes to the PageRank algorithm in January.
Please see also Barry Schwartz's post, "A Deeper Look At Google's Search Quality Efforts", which provides some additional commentary on Udi's post.

Please see also my earlier post, "The perils of tweaking Google by hand", which talks about whether these thousands of twiddles to the search engine and variations of them should be constantly tested rather than just evaluating one version of them at the time they are created.

1 comment:

Anonymous said...

But IBM is an initialism, so it gets the Geresh(׳) instead of the Gershayim(״). Laser, and NASA are acronyms.