Monday, July 07, 2008

Black, white, and gray spam

Scott Yih, Robert McCann, and Alek Kolcz had a paper at CEAS 2007 on "Improving Spam Filtering by Detecting Gray Mail" (PDF).

The paper focuses on training a classifier to detect unwanted e-mail, which they called gray mail, but what excites me more is the broader idea of shades of gray in what we think of as spam. The ideas behind the paper seem to suggest that spam should be more of a continuum. Documents vary from annoying to interesting, with much falling in the middle.

Part of my motivation here comes from feeling bothered for some time at what seems to me to be a weak distinction between spammy pages and useless pages.

If a web page exists but no one ever reads it, does it matter?

If a blog exists but has no subscribers, does it matter that it is not actually spam?

In general, whether some document is uninteresting because it is manipulative or uninteresting for another reason, isn't it still uninteresting?

If we do treat spam and uninteresting as similar, it seems like it could make the spam problem somewhat easier. A false positive on spam is much less costly if the penalized document was uninteresting anyway.

Please see also my August 2006 post, "Web spam, AIRWeb, and SIGIR".

Please see also Googler Matt Cutt's post, "Using data to fight webspam".

No comments: