Friday, February 17, 2006

Google and The Happy Searcher

I recently happened across a paper by four Googlers called "The Happy Searcher: Challenges in Web Information Retrieval" (PDF).

It includes discussions of some of the challenges in web, image, audio, and video search, dealing with spammers, evaluating relevance rank quality, and search, reputation, and recommendations in Google Groups.

It's a great read and an easy, light read. Even if you don't normally read technical papers, this one is definitely worth a peek.

Here are some selected extended excerpts:
One particularly intriguing problem in web IR arises from the attempt by some commercial interests to unduly heighten the ranking of their web pages by engaging in various forms of spamming .... Classification schemes must work in an adversarial context as spammers will continually seek ways of thwarting automatic filters. Adversarial classification is an area in which precious little work has been done.

Proper evaluation of [relevance rank] improvements is a non-trivial task ... Recent efforts in this area have examined interleaving the results of two different ranking schemes and using statistical tests based on the results users clicked on to determine which ranking scheme is "better". There has also been work along the lines of using decision theoretic analysis (i.e., maximizing users' utility when searching, considering the relevance of the results found as well as the time taken to find those results) as a means for determining the "goodness" of a ranking scheme.

At the article or posting level, one can similarly rank not just by content relevance, but also take into account aspects of articles that not normally associated with web pages, such as temporal information (when a posting was made), thread information, the author of the article, whether the article quotes another post, whether the proportion of quoted content is much more than the proportion of original content, etc .... Furthermore, one can also attempt to compute the inherent quality or credibility level of an author independent of the query.

Content Similarity Assessment ... [attempts] to find images (audio tracks) that are similar to the query items. For example, the user may provide an image (audio snippet) of what the types of results that they are interested in finding, and based on low-level similarity measures, such as (spatial) color histograms, audio frequency histograms, etc, similar objects are returned. Systems such as these have often been used to find images of sunsets, blue skies, etc. and have also been applied to the task of finding similar music genres.

The Google spelling corrector takes a Machine Learning approach that leverages an enormous volume of text to build a very fine grained probabilistic context sensitive model for spelling correction ... By employing a context sensitive model, the system will correct the text "Mehran Salhami" to "Mehran Sahami" even though "Salami" is a common English word and is the same edit distance from "Salhami" as "Sahami." Such fine grained context sensitivity can only be achieved through analyzing very large quantities of text.

Web information retrieval presents a wonderfully rich and varied set of problems where AI techniques can make critical advances ... We hope to stimulate still more research in this area that will make use of the vast amount of information on the web.
If you like this paper, there's another older paper that might also be of interest, "Web Information Retrieval - an Algorithmic Perspective" by Monika Henzinger from Google. It is a nice, short, light overview of the components of a web search engine and some of the challenges of web search.


Venkatesh said...

Thanks for the link. Interesting read.

Nicholas Ochiel said...

As always, thanks for the reading recommendations.