Thursday, December 08, 2005

Survey paper "Deeper Inside PageRank"

Ho John Lee pointed to a long but truly excellent survey paper on PageRank, "Deeper Inside PageRank" (PDF) by Langville and Meyer.

The 46 page paper not only describes PageRank and twiddles of PageRank in detail, but also it talks about research on optimizing the PageRank computation and generating personalized versions of PageRank. It's a thick, dense paper, a lot of work to plow through, but I found a lot of juicy food for thought in there.

I ended the paper buzzing with questions, primarily around the probabilities of link transitions and the personalization (aka teleportation) vector. If you've got a good understanding of PageRank, I'd appreciate it if you could comment on my thoughts below and let me know if I've gone astray.

On the probabilities of transitioning across a link in the link graph, the paper's example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that "any suitable probability distribution" can be used instead including one derived from "web usage logs".

Similarly, section 6.2 describes the personalization vector -- the probabilities of jumping to an unconnected page in the graph rather than following a link -- and briefly suggests that this personalization vector could be determined from actual usage data.

In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these -- the probability of following a link and the personalization vector's probability of jumping to a page -- to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.

But, if I have enough usage data to do this, can't I calculate the equivalent PageRank directly? Let's say I have a toolbar (like Alexa, Yahoo, or Google) or ISP logs (like MSN or AOL) that gives me data on everything people visit. Instead of weighting the links in the link graph using the usage data, can I ignore the link graph and rank pages by their likelihood of being visited?

What's the difference between these two calculations? In one, I'm summing over the probability that surfers come over inbound links to find the probability that people will visit the page. In the other, I'm computing that probability directly from who actually visited the page. Using the link graph would seem to be only something to use if you don't have the usage data, an indirect estimate of the relevance of a page that you could calculate directly given enough data.

Now, I am assuming here that the value of a link is entirely determined by how much that link is used. If an unused link on a page does have meaning and should influence the relevance of the linked page, then this all falls apart.

But is this otherwise accurate? With enough data on what pages people visit, could the calculation over the link graph be eliminated or at least reduced?

By the way, don't miss the interesting discussion in the paper on using the personalization vector for personalized search by using a different vector for different groups of people. There are severe scaling issues with this method of search personalization, which can be partially addressed by some of the ideas from the Google Kaltix folks. For more on that, see my previous post, "More on Google personalized search".

[Ho John Lee post via Brian Dennis]

Update: Ho John Lee responded with some comments. Definitely worth reading.

6 comments:

Ho John Lee said...

Hi Greg,

I think applying the actual usage data is valuable but probably wouldn't want to use it alone for most applications.

More comments in my post here.

Tejaswi said...

The importance of linking is not about people actually traversing the link, and thus making the destination popular (which could be just measured by monitoring the logs of the destination, as you said). The links are more about the intent of the creator of the source page. By linking to your blog from my webpage, I am telling the world that you are good, and if I am say Google's home page (one of the popular pages), then, you must be very good. Its like, I am good, I know you, and so you are good as well. Or, you are good if many many many people know you (link to you).

On a related historical note, PageRank was inspired by a 1930s social sciences paper which used the above people-knowing-people model to figure out who is the most popular student in a university or some such.

The personalization vector could be found out by taking user preferences, or could be found out by user behavior tracking, or by using query logs. There is some work by Haveliwala on this.

The probability transition vector is the KEY (repeat, KEY) to the success of PageRank. It prevents spamming (by jumping back to "good sites" every so often), controls the "topic influence" if you are using personalization, etc. Read this somewhat involved paper on the jump probability parameter by Boldi and co. http://www2005.org/cdrom/docs/p557.pdf

The Langville and Meyer survey is very good, I read it a while back, and also had made a seminar on pagerank, which is available here:
http://www.it.iitb.ac.in/~tejaswi/seminar.html

Tejaswi said...

Update: You might be interested in reading Section 7 of my seminar for references to how each link from a source to destination can be given different priorities as well. Other sections related to introduction, HITS, pagerank, etc. can be skipped.

Greg Linden said...

Thanks, Tejaswi. Great comments and I appreciate the references.

However, I'm still not sure about the idea that a link has value by itself. It seems to me that the value of a link is in the traffic that it sends.

A way to boil this down might be to ask ourselves a couple questions:

Does a link that is never followed have no value?

Should links be valued differently depending on their prominence (position on the page, etc.) or other estimates of the traffic on the link?

If the answer to these is yes and yes, then it seems to me that the link analysis of PageRank is attempting to estimate the traffic to the site. If we actually have large samples of traffic data for all sites on the Web, we might be able to come up with a better estimate or, at a minimum, supplement and simplify the PageRank computation.

Do you disagree? If so, how would you weight the link transition probabilities as part of computing PageRank? Would you count all links on the page equally?

Tejaswi said...

To reiterate that again, (as they say in the department of redundancy department), Links mean a judgement of quality, and not just traffic. Of course, quality links are traversed, and incidentally, it also generates traffic. But thats incidental. PageRank's principle is to use the page creator's editorial jugement regarding the destination page.

On an aside, we also do not want to get into a vicious loop here by ranking pages on traffic, and thereby giving them a higher rank, and generating more traffic to them. Btw, this self serving vicious rich-become-richer paradigm also takes link based PageRank also for a victim. A new page that enters the Web has no way of getting ranked on prominent search engines unless they already have traffic. Search engines drive most of the traffic, and not just random clicking on links by people (unless its wikipedia, of course). You can read a survey of that here:
http://www.www2004.org/proceedings/docs/1p20.pdf

As for counting links on a page equally, I am pretty sure that most search engines do not do that (in their pagerankish compuataion). For example, a link pointing to a "download acrobat reader here" wouldn't be given as much importance as a link that points to some other place. But you also see that PageRank is query independent, and so, how do you know that a link is "irrelevant?" Every link is valuable to some query. So, while doing query independent offline ranking, each link needs to be evaluated the same. But while personalizing ranking itself, or if you want to use some topic/user conditioned ranking, each link might get a different weight based on its anchor text, surrounding text, DOM tree of the page, etc.

Tejaswi said...

Update to correct bad typo in previous comment:

As for counting links on a page equally, I am pretty sure that most search engines do that already (in their pagerankish compuataion).