Wednesday, November 19, 2008

Detecting spam just from HTTP headers

You have to love research work that takes some absurdly simple idea and shows that it works much better than anyone would have guessed.

Steve Webb, James Caverlee, and Calton Pu had one of these papers at CIKM 2008, "Predicting Web Spam with HTTP Session Information" (PDF). They said, everyone else seems to think we need the content of a web page to see if it is spam. I wonder how far we can get just from the HTTP headers?

Turns out surprisingly far. From the paper:
In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.

We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.
It appears that web spammers tend to use specific IP ranges and put unusual gunk into their headers (e.g. "X-Powered-By" and "Link" fields), which makes it fairly easy to pick them out just from their headers. As one person suggested during the Q&A for the talk, spammers probably would quickly correct these oversights if it became important, but you still have to give credit to the authors for this cute and remarkably effective idea.

If you might enjoy another example of making remarkable progress using a simple idea, please see my earlier post, "Clever method of near duplicate detection". It summarizes a paper about uniquely identifying the most important content in a page by looking around the least important words on the page, the stop words.

1 comment:

Justin Mason said...

88% is about right for that kind of thing; we found that in SpamAssassin, too, with email spam, looking at just email headers. The hard part is getting through those remaining 11.8%, which requires exponentially more work as you progress ;)