I am giving a very short talk at AIRWeb (a workshop on web spam) at SIGIR.
I thought some readers of this weblog might be interested in web spam but unable to make it to this workshop. It might be fun to discuss the topic a bit here.
I wanted to use my short talk to bring up three topics for discussion:
First, I wanted to talk about the scope of weblog junk and spam, especially if "junk and spam" weblogs are loosely defined as "any weblog not of general interest."
The primary data points on which I will focus are that Technorati reported 19.6M weblogs in Oct 2005, but the dominant feed reader, Bloglines, reported that only 1.4M of those weblogs have any subscribers on Bloglines and a mere 37k have twenty or more subscribers.
This seems to suggest that over 95% of weblogs, possibly over 99%, are not of general interest. The quality of the long tail of weblogs may be much worse than previously described.
Second, I wanted to bring up the profit motive behind spam. Specifically, I will mention that scale attracts spam -- that the tipping point for attracting spam seems to be when the mainstream pours in -- and that this has implications for many community-driven sites that currently only have an early adopter audience.
Third, I wanted to discuss how "winner takes all" encourages spam. When spam succeeds in getting the top slot, everyone sees the spam. It is like winning the jackpot.
If different people saw different search results -- perhaps using personalization based on history to generate individualized relevance ranks -- this winner takes all effect should fade and the incentive to spam decline.
What do you think? I would enjoy getting a discussion on web spam going in the comments!
Update: The talk went well. Let me briefly summarize some of the comments I received at the workshop on what I said above.
On the Technorati vs. Bloglines numbers, a few people correctly pointed out that this could be seen as recall vs. precision and that, for some applications, it may be important to list every single weblog, even if that weblog has no readers. At least a couple others disputed whether weblogs without readers were important at all. One mentioned that readers may be able to be faked, which might allow spammers a way to attack this type of filter.
On the "winner takes all" and personalization as a potential solution, some seemed skeptical that there was enough variation in individual perceptions of relevance to make a big enough impact. Others seemed intrigued by the possibility of using user behavior and recommender systems to filter out spam.
I enjoyed talking about web spam in such a prestigious group! Great fun!
Update: See also the paper, "Adversarial Information Retrieval on the Web" (PDF), which gives a good summary of the discussions at the workshop.