Saturday, September 01, 2007

HITS, PageRank, and keeping it simple

A SIGIR 2007 paper out of Microsoft Research, "HITS on the Web: How does it Compare?" by Marc Najork, Hugo Zaragoza, and Michael Taylor is a large-scale study of several ranking algorithms using a substantial web crawl and data from the MSN query logs.

The authors appear to have expected the HITS algorithm to outperform the others in their tests, but found instead that a combination of BM25F and simple in-degree link analysis outperformed everything else. From the paper:
We were quite surprised to find that HITS, a query-dependent feature, is about as effective as web page in-degree, the most simpleminded query-independent link-based feature.

As expected, BM25F outperforms all link-based features by a large margin. The link-based features are divided into two groups, with a noticeable performance drop between the groups. The better-performing group consists of the features that are based on the number and/or quality of incoming links (in-degree, PageRank, and HITS authority scores); and the worse-performing group consists of the features that are based on the number and/or quality of outgoing links (outdegree and HITS hub scores).

The combination of BM25F with ... id in-degree consistently outperforms the combination of BM25F with PageRank or HITS authority scores, and can be computed much easier and faster.
PageRank performed poorly in their tests. However, their explanation of why struck me as unconvincing. From the paper:
The fact that in-degree features outperform PageRank under all measures is quite surprising. A possible explanation is that link-spammers have been targeting the published PageRank algorithm for many years, and that this has led to anomalies in the web graph that affect PageRank.
This begs the question of whether they picked the right PageRank algorithm. In particular, there are variants of PageRank that they could have used that appear less sensitive to spam and may have performed much better. Unfortunately, without results for those variants, it is hard to know whether the criticisms in this paper of naive PageRank are applicable to the algorithms evolved from PageRank used by search engines today.

Even so, the results of the study are interesting, both for the overview of several relevance ranking algorithms and the conclusions about their effectiveness. Particularly intriguing is the evidence that computationally expensive algorithms such as the query-dependent HITS algorithm seem to hold no advantage over much simpler techniques.

Update: Marc Najork, one of the authors of the paper, expands on the PageRank algorithm issue and the performance of HITS in the comments for this post.

6 comments:

Anonymous said...

It is commented that Google's use of PageRank is much, much smaller than expected. PageRank is a nice marketing tool for both the consumers and the researchers, but Google tends to rely much more on low level features like anchor text and HTML tags.

Sérgio Nunes said...

I still haven't read the full paper but it seems relevant to this post:

Decoding the structure of the WWW: A comparative analysis of Web crawls
http://doi.acm.org/10.1145/1255438.1255442

From the abstract: "We find that, despite the very large size of the samples, the statistical measures characterizing these graphs differ quantitatively, and in some cases qualitatively, depending on the domain analyzed and the crawl used for gathering the data."

jeremy said...

Jeremy Zawodny made some interesting comments about this in 2003:

http://jeremy.zawodny.com/blog/archives/000751.html

"PageRank stopped working really well when people began to understand how PageRank worked. The act of Google trying to 'understand' the web caused the web itself to change."

Marc said...

Greg,

We experimented with various ways to bias the PageRank "teleportation" vector to make it more spam-resistant, and that improves the effectiveness somewhat, but it's still a long way off from BM25F.

Regarding computationally expensive techniques such as HITS being a waste of time: HITS is indeed not doing all that great, but as it turns out, close relatives of HITS are doing much better. There is a CIKM paper forthcoming that measures the performance of Lempel & Moran's SALSA algorithm on the same data sets, and it substantially outperforms HITS. And we found other algorithms in the HITS/SALSA family that get very close to BM25F.

-- Marc

François Schiettecatte said...

The BM25F ranking algorithm is described here:

http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf

ak102 said...

Isn't the reason Google's PageRank seems more relevant than others is because they've seen the web graph before everyone else? and before the spammers spoiled it? so instead of recomputing pagerank all the time, they just tweak it slightly?