An amusing WWW 2008 poster by Joshua Blumenstock, "Size Matters: Word Count as a Measure of Quality on Wikipedia" (PDF) found that a very simple measure of "quality" of article on Wikipedia, the number of words in the article, performed nearly as well as much more complicated classifiers.
Word count with a simple threshold achieved 96.5% accuracy in the classification task. The other more complicated techniques ranged from 81-98% accuracy.
Joshua is careful in the paper to not "exaggerate the importance of this metric", but I believe there is a lesson here: Try the simple things first. It can often be surprising how well something simple works.
Reminds me of "Pivoted Document Length Normalization", http://citeseer.ist.psu.edu/singhal96pivoted.html and I guess the avg length of the featured articles could be a parameter into this technique.
ReplyDeleteI never noticed the wikipedia featured articles, but it's great to have that as it's a fun task to try to train a classifier to recognize other "good" articles, and it looks like from the references this has already been explored.
-- Dave
We found similar correlations between word count and quality at Epinions. Eric.
ReplyDeleteThis doesn't mean the other metrics are useless.
ReplyDeleteI suppose other metrics make different classification errors that the 2000-words predictor, and I expect a combination of several methods would reach a higher score.
To amend my last comment: I missed the last paragraph; the authors tried mixing techniques to reach 98% accuracy.
ReplyDelete