Tuesday, March 14, 2006

Popularity, entrenchment, and randomization

The paper "Shuffling a Stacked Deck" by Pandey et al. has an interesting discussion of the "entrenchment problem" of ranking based on popularity and suggests partial randomization as a solution.

From the paper:
Most Web search engines assume that popularity is closely correlated with quality, and rank results according to popularity.

Unfortunately, the correlation between popularity and quality is very weak for newly-created pages that have few visits and/or in-links.

Worse, the process by which new, high-quality pages accumulate popularity is actually inhibited by search engines. Since search engines ... always [list] highly popular pages at the top, and because users usually focus their attention on the top few results, newly-created but high-quality pages are "shut out."
The paper goes on to claim that their experimental results show that search engines should make 10% of the results randomly reranked items that would normally appear much lower in the search results. That, they say, results in enough exploration to eliminate the entrenchment problem and improves search quality.

However, while the results make sense for small document collections, I'm not sure whether their results easily apply to the real world of web search. Web search engines index tens of billions of documents. The simulations in this paper used a data set six orders of magnitude smaller, 10k pages.

That's a big problem. On a more realistically sized collection, you would expect randomization to be a lot more difficult because there would many, many more pages from which to randomly select. Most of those pages would be irrelevant. Big data makes this a much harder problem.

Nevertheless, I think the general idea of testing new items to determine their quality and value is a good one, particularly for systems ranking based on popularity or trying to make personalized recommendations.

On that note, I thought this tidbit in the paper was interesting:
The entrenchment problem may not be unique to the Web search engine context ... consider recommendation systems ...

Many users decide which items to view based on recommendations, but these systems make recommendations based on user evaluations of items they view. This circularity leads to the well-known cold-start problem, and is also likely to lead to entrenchment.

Indeed, Web search engines can be thought of as recommendation systems that recommend Web pages.
Exactly right. In fact, dealing with the cold start and entrenchment problems were some of the trickiest parts of the recommendation engine behind Findory. With news articles, there is a constant stream of new documents coming in to the system. Determining their quality, popularity, and relevance quickly is a challenging problem.

[Pandey paper via Paul Kedrosky]

No comments: