There was quite a group of experts on hand at the SIGIR AIRWeb workshop, researchers from Google, Yahoo, Microsoft, AOL, Technorati, Bloglines, Snap, and more.
I wanted to follow up on one particularly interesting tidbit to come out of the group discussions. The group was asked, "What is your biggest concern in weblog spam?"
By far, the biggest concern was the proliferation of spam weblogs with fake content that looks real.
These appear to take a few forms. One is to manually steal content from other places and post it to your blog without any reference to the original creator of the content.
Another technique is to post everything from an RSS feed or a combination of feeds to a blog, again without any credit to the original sources.
Another is to create a weblog that is just nonsense but not immediately obviously nonsense. This can be done by stitching together sentences from feeds or by being even more clever and making a good random text generator (like this example or this one).
Once the splog is created, it is typically used for ad revenue (using AdSense or whatever), link spam (to improve PageRank), or both.
It is a hard problem to solve. For the stolen content, crawlers need to do a good job of identifying the original, authoritative source of the content. For random content, crawlers may be able to recognize statistical outliers or grammatical errors in the content, but, in the worst case, may have to try to parse and minimally understand the content.
Spam spam spam spam. Lovely spam! Wonderful spam!