Friday, January 26, 2007

Sender reputation in GMail

There are some interesting tidbits on the spam filtering in GMail in the paper "Sender Reputation in a Large Webmail Service" (PDF) by Googler Bradley Taylor.

In particular, it sounds like one of several spam filters in GMail uses the aggregate data on what data is marked as spam (by other automatic classifiers or manually by the user) to determine the spam reputation of a domain.

If e-mails sent from a particular domain are consistently marked as spam, the reputation of that domain will deteriorate, and then eventually the entire domain will be blacklisted. Simple but good idea.

I'd really like to see an extension of this where reputation of individual senders are monitored, not to blacklist, but to whitelist.

For example, Findory has a feature where people can get a daily e-mail of their Findory front page, essentially a mailing list. As the Google paper says, "Some users are lazy and find that reporting spam on a mailing list is easier than unsubscribing ... For the sender it means their reputation is hurt worse than it needs to be."

However, mail from glinden@findory.com presumably is never marked as spam. Whitelisting individual senders with good reputation seems like it would be a good refinement to blacklisting domains with bad reputation.

6 comments:

Nate Olson said...

Hi, Greg,

Just a quick pointer to Simon Willison's recent post on "social whitelisting" with OpenID credentials. Not the same automation/datamining angle as you seem to be touching on here, but the two approaches certainly could complement one another.

Cheers,

Nate

Anonymous said...

Interesting. This is also one of the techniques I use at Simpy to fight spam. I love seeing the system zap 'em!

Anonymous said...

What is to stop me now sending a million spams from glinden@findory.com, since your reputation is so good that a higher proportion of them will get through?

If you can solve that problem then you've solved the "sender identity" problem and thus eliminated most spam anyway.

Greg Linden said...

Great point, Anonymous. If the system I proposed became widespread, spammers would just forge e-mail addresses to get their spam through. Preventing that would require solving the sender id problem.

You are right, probably not a great solution then. Thanks for pointing out the issue here.

Greg Linden said...

Thinking about this more, I think it might be a little better than you say, Anonymous.

The Google paper does not solve the sender id problem, but they are able to identify the domain of the sender (a much easier problem).

Forgeries from people outside of the domain are already detected in the Google system. The only major concern is forgeries from people in the domain.

So, I think the original idea is still valid. It might be useful to whitelist people with high reputation in a domain with a medium reputation.

Anonymous said...

Funny you should point this out. I've been working for the last month or 2 on a spam filter that uses some of the techniques you mention.

If anybody's interested in checking it out when it gets fully baked, pls feel free to drop an email to darose at darose dot net.