Monday, May 02, 2005

Web spam and TrustRank

I finally managed to get a good look at the "Combating Web Spam with TrustRank" paper by Gyongyi et al this weekend.

TrustRank takes a manually designated set of good or bad pages and propagates that information across the link graph. It's an interesting modification to PageRank. Definitely worth a read.

The paper describes a manual process for determining the seed set of trusted sites. I'm curious what we'd find by instead analyzing user behavior. For example, we could consider websites used over the past month by trusted people to be trusted. That is, trusted sites would be the sites the community uses and trusts.

Noisier data, to be sure, but there's a sea of data here, enough that we should be able to be robust to the noise and discover the wisdom hidden within. Ah... So many interesting possibilities with this kind of juicy data.

By the way, if you're interested in web spam, don't miss "Web Spam Taxonomy", also by Zoltan Gyongyi and Hector Garcia-Molina. It's a light paper that describes many of the devious techniques used by web spammers.

Update: There appears to be a recent March 2006 technical report on this TrustRank work, "Link Spam Detection Based on Mass Estimation" (PDF).

No comments: