Friday, August 11, 2006

Blog spam at AIRWeb

There was quite a group of experts on hand at the SIGIR AIRWeb workshop, researchers from Google, Yahoo, Microsoft, AOL, Technorati, Bloglines, Snap, and more.

I wanted to follow up on one particularly interesting tidbit to come out of the group discussions. The group was asked, "What is your biggest concern in weblog spam?"

By far, the biggest concern was the proliferation of spam weblogs with fake content that looks real.

These appear to take a few forms. One is to manually steal content from other places and post it to your blog without any reference to the original creator of the content.

Another technique is to post everything from an RSS feed or a combination of feeds to a blog, again without any credit to the original sources.

Another is to create a weblog that is just nonsense but not immediately obviously nonsense. This can be done by stitching together sentences from feeds or by being even more clever and making a good random text generator (like this example or this one).

Once the splog is created, it is typically used for ad revenue (using AdSense or whatever), link spam (to improve PageRank), or both.

It is a hard problem to solve. For the stolen content, crawlers need to do a good job of identifying the original, authoritative source of the content. For random content, crawlers may be able to recognize statistical outliers or grammatical errors in the content, but, in the worst case, may have to try to parse and minimally understand the content.

Spam spam spam spam. Lovely spam! Wonderful spam!

2 comments:

Anonymous said...

"... may have to parse and minimally understand..."

This is a really interesting problem. I have given a little thought (I really mean just a little, some people have made a lifetime study of this) to the complexities of natural language parsing, and at first, I thought it would be fairly easy to do. Yes, you can laugh and throw things now if you want.

I started to think of a simple parser. What if it analysed a sentence at a time, and categorised each word by type, for instance noun, preposition, etc. To determine if a sentence is valid, you just see if the word types fall into recognisable sentence patterns.

Then I realised how easy it would be to fake this out - by doing what they are already doing, namely stealing bits from others' blogs, or two, by using a very simplistic Madlibs style generator that just made nonsense that was grammatically correct.

I started to think about language, and how we learn it. I came to realise that language is how we express everything we understand about the world. It is not just nouns and adjectives, it is emotions, style, subtle shades of meaning. One word can have several different meanings depending on its context.

To simply try and put all this knowledge into a huge database will not help either, because people add facts to their knowledge base differently depending on pre-existing worldviews, and sometimes a convincing argument can sway their world view. A computer cannot do this, and (to my limited thinking) will battle to simulate it.

My theory on preventing spam, is that instead of pushing the extremely difficult task of natural language processing onto the computer, which is very ill-suited for the task, onto the blog spamming software.

CAPTCHA images and simple validation schemes are problematic because of that - they are simple for computers to parse. They don't require intelligence and experience, merely crude pattern recognition. The only way to foil auto readers is to add noise (distortion, background noise) which also makes life more difficult for the humans.

What about this instead: a simple database with a pool of words, a pool of questions, and a linking table which puts words in certain categories. You pose a question like "Type in the colour from the list above" and list words like "aerosol, blue, printer". Screen readers can read it. People can solve this easily, not so the spam bots. Yes, they can just try and post with every word in the list, but you can blacklist or tarpit IPs with frequent failed tries.

What's more, the system is easily upgradeable. Spammers starting to write programs to find colours? Just update your database. Now start asking what are metal objects. Ask for the third last word in the original blog post. The possibilities are endless, and so darned easy to code.

This type of test is very difficult for computers, but simple for humans. I say let's use computers' weaknesses against the spammers, not put them in a position to use computers' strengths against us.

Anonymous said...

I like my ideas so much I am going to post my comments on my blog at techrepublic.

Yes, I am very humble. ;-)