Monday, February 27, 2006

Manual vs. automated tagging

Rich Skrenta of Topix.net posted a good critique of manual tagging of documents:
Tags aren't a panacea, since they're excessively vulnerable to spam, and the items which should belong to the same categories will get different tags from different users. Which is it, "topixnet"? or "topix"?

They're uniquely valuable in a system like Flickr since photos don't have any text of their own to keyword search, so getting the user to add any searchable text at all is a big win. You can ask users to caption their photos but often putting just a word or two is easier so the participation level is higher.

But if you have the full text of the web, or blogosphere, or whatever, the marginal utility of the "keywords" tag on the document seems to be rather low. To deal with spam and relevance issues, the search interface for a large collection needs to be appropriately skeptical about what documents are claiming to be about.
This reminds me of what Danny Sullivan said about manually tagging documents:
All the interest (dare I say hype) is largely ignoring the fact that we've had tagging on the web for going on 10 years, and the experience on the search side is that it can't be trusted.

The meta keywords tag has been around for nearly a decade. The idea behind it in part was that people could use the tag to classify what their pages are about.

The data is largely useless ... Thinking that tagging would lead to top rankings, some people misused the tag. Other people didn't misuse the tag intentionally, but they might poorly describe their pages.
And later went on to say:
Wide-open tagging, where anyone can get their pages to the top of a list just by labeling it so, is going to be a giant spam magnet.
Stephen Green at Sun summarizes it well:
[Tagging is] not really a new way of indexing documents, it's actually an old way that didn't work very well.
The real test of manual tagging of documents will be when these tagging tools become large enough to drive substantial traffic to websites.

Right now, these tools mostly are used by early adopters. This small, dedicated, loyal audience tends to behave well because there is little incentive to do otherwise.

If the tools become more popular, the incentive to manipulate them will increase. The community-generated tags will become less and less reliable as spammers enter seeking traffic and profit.

This will be the real challenge to manual document tagging. It remains to be seen whether the wisdom of the crowd can prevail over the deceptions of the scammers.

2 comments:

Seth Russell said...

Well manual and automatic tagging will be equally target by spammers; but manual tagging has the advantage of policing by the comunity. You do not shit on your own door step.

Our site already gets substantial from tagging. If you are inside the community, then you can find things there far more rapidly than you can find them at google.

I bloged about this here. An interesting confluence of events happened all withing this same hour: (1) guesst-7779 visited my article, (2) i saw that and review the article myself (3) your post came into my aggregator. Kewl, don't you think?

Eloy Jetson said...

Although I agree that manual tagging is prone to spamming, isn't the better solution fast and furious enforcement.

In a social context, I think that the wisdom of the crowd wins out when it comes to catching a new angle on an old problem. If I am looking for information that I say is tagged x but you have it tagged x and y it might spark an idea that applying it to y may have been something I had never thought of before.

The problem with automated tagging is that an algorithm will classify everything to a perfect fit every time. This doesn't take into account the creativeness of a real person.