Wednesday, August 06, 2008

BrowseRank: Ranking pages by how people use them

Liu et al. from Microsoft Research Asia had the best student paper at SIGIR 2008, "BrowseRank: Letting Web Users Vote for Page Importance" (PDF), that builds a "user browsing graph" of web pages where "edges represent real transitions between web pages by users."

An excerpt:
The user browsing graph can more precisely represent the web surfer's random walk process, and thus is more useful for calculating page importance. The more visits of the page made by users and the longer time periods spent by the users on the page, the more likely the page is important ... We can leverage hundreds of millions of users' implicit voting on page importance.

Some websites like adobe.com are ranked very high by PageRank ... [because] Adobe.com has millions of inlinks for Acrobat Reader and Flash Player downloads. However, web users do not really visit such websites very frequently and they should not be regarded [as] more important than the websites on which users spend much more time (like myspace.com and facebook.com).

BrowseRank can successfully push many spam websites into the tail buckets, and the number of spam websites in the top buckets in BrowseRank is smaller than PageRank or TrustRank.

Experimental results show that BrowseRank indeed outperforms the baseline methods such as PageRank and TrustRank in several tasks.
One issue that came up, in discussions afterward about the paper, is whether BrowseRank gets something much different than a smoothed version of visit counts.

Let's say all the transition probabilities between web pages are set by how people actually move across the web, all the starting points on the web are set by where people actually start, and then you simulate random walks. If all this is done correctly, your random walkers should move like people do on the web and visit where they visit on the web. So, after all that simulating, it seems like what you get should be quite close to visit counts.

This is a bit of an oversimplification. BrowseRank uses time spent on a page in addition to visit counts and its Markov model of user behavior ends up smoothing the raw visit count data. Nevertheless, the paper does not compare BrowseRank to a simple ranking based on visit counts, so the question still appears to be open.

Please see also my earlier post, "Google Toolbar data and the actual surfer model".

Please see also my earlier post, "Ranking using Indiana University's user traffic", which discusses a WSDM 2008 paper that looked at supplementing PageRank with traffic data.

3 comments:

Anonymous said...

Very interesting, although this has been tried before. DirectHit had a search engine built entirely on clickstream data (Acquired by Ask.com in 2000). They got the data from ISPs in those days. The end-result is really not that much better than Page-Rank.

Me.dium on the other hand (http://me.dium.com/search) is processing user's clickstream data in real-time to create a different lens based on what's going on now. e.g. do a search for John Edwards on Google or Live, and you get johnedwards.com and wiki/johnedwards. Do the same search on Me.dium and you learn that today people care about his love child, pictures of his mistress, etc.

The difference is real-time (what people are browsing now) vs. historical (what they browsed in the past). Social vs. Old School. Check it out. http://me.dium.com/search.

Anonymous said...

Hi Greg,

I got this link from SEOMOZ. I would like to know more about BrowseRank is there any way you can help me know more about it. I would highly appreciate it as the pdf link provided does not work.

Hoping to hear form you

Mohammed T

Greg Linden said...

Hi, Mohammed. Looks like the location of the PDF changed. I corrected the link. You should be able to get to the paper now.