Thursday, May 01, 2008

Size matters? Or simplicity?

An amusing WWW 2008 poster by Joshua Blumenstock, "Size Matters: Word Count as a Measure of Quality on Wikipedia" (PDF) found that a very simple measure of "quality" of article on Wikipedia, the number of words in the article, performed nearly as well as much more complicated classifiers.

Word count with a simple threshold achieved 96.5% accuracy in the classification task. The other more complicated techniques ranged from 81-98% accuracy.

Joshua is careful in the paper to not "exaggerate the importance of this metric", but I believe there is a lesson here: Try the simple things first. It can often be surprising how well something simple works.

4 comments:

Dave said...

Reminds me of "Pivoted Document Length Normalization", http://citeseer.ist.psu.edu/singhal96pivoted.html and I guess the avg length of the featured articles could be a parameter into this technique.

I never noticed the wikipedia featured articles, but it's great to have that as it's a fun task to try to train a classifier to recognize other "good" articles, and it looks like from the references this has already been explored.

-- Dave

Eric Goldman said...

We found similar correlations between word count and quality at Epinions. Eric.

Anonymous said...

This doesn't mean the other metrics are useless.
I suppose other metrics make different classification errors that the 2000-words predictor, and I expect a combination of several methods would reach a higher score.

Anonymous said...

To amend my last comment: I missed the last paragraph; the authors tried mixing techniques to reach 98% accuracy.