Wednesday, February 21, 2007

Combating tag spam paper

Gary Price points to a paper out of Stanford CS titled "Combating Spam in Tagging Systems" (PDF).

From the abstract:
As tagging systems are gaining in popularity, they become more susceptible to tag spam: misleading tags that are generated in order to increase the visibility of some resources or simply to confuse users.

We are interested in answers to questions such as: How many malicious users can a tagging system tolerate before results significantly degrade? What types of tagging systems are more vulnerable to malicious attacks? What would be the effort and the impact of employing a trusted moderator to find bad postings? Can a system automatically protect itself from spam, for instance, by exploiting user tag patterns?
The paper is heavy on theory, spending most of its time laying down frameworks for analyzing problems with spam in tagging systems. I particularly liked the idea of determining tagger reputation and reliability that is discussed in Section 4.2.

Though the paper is a good read, my personal view on tagging is that, rather than being a solution to web spam, it does little more than shuffle around the problem. Instead of sorting out spam from web pages and keywords extracted from those pages, now we have to figure out how sort out the spam from web pages and the tags associated with those pages.

In fact, I suspect tag spam is going to be an even harder problem than web spam. With web spam, spammers can only modify the pages on the Web they have access to. With tag spam, spammers can label any item anywhere.

Right now, tagging is not mainstream. With small audiences, the potential profit is too tiny to attract much attention from spammers. That is unlikely to remain true if tagging systems continue to grow.

See also the discussion on tagging and tag spam in my previous posts, "Questioning tags" and "Chris Sherman on social search".

Update: This paper will be presented at WWW 2007 AIRWeb in a couple days. The latest version (PDF), appears to have some changes.

I reread the paper recently and wanted to highlight a few other parts of the paper that may be of interest:
We propose an approach to tag search that takes into account not only the number of postings that associate a document with a tag but also the "reliability" of taggers that made these postings. In order to measure the reliability of a user, we define the coincidence factor c(u) of a user u ... [which] shows how often u's postings coincide with other users' postings.

Our hypotheses is that the coincidence factor is an indication of how "reliable" a tagger is. A high factor signifies that a user agrees with other taggers to a great extent; thus, the user's postings are more "reliable".

Overall, the existence of popular tags provides many opportunities for malicious users to misuse tags and spam searches. We hope that the model we have proposed here, and the results it yields, can provide useful insights on how to wage these ongoing "spam wars."
I emphasized an analysis of tagger reputation in this excerpt, but the paper also contains analyses of the impact of different types of attacks and a discussion of the value of human moderators.

On human moderators, the verdict of the paper appears mixed. At one point, it says that, in their simulations, even perfect moderators who only reject bad tags would have to review the tags on 5% of all documents to get a factor of 2 drop in spam. That's quite a lot of work for a relatively small gain. However, later, the authors say that moderators may be the best countermeasure for "focused spam attacks" that may defeat the coincidence analysis they use for tagger reputation.

No comments: