Wednesday, October 08, 2008

Netflix Prize at KDD 2008

The Large Scale Recommender Systems and the Netflix Prize workshop was recently held at KDD 2008. I was not able to attend, but I still wanted to highlight a few of the papers from and related to the workshop.

Gavin Potter, the famous guy in a garage, had a short paper in the workshop, "Putting the collaborator back into collaborative filtering" (PDF). This paper has a fascinating discussion of how not assuming rationality and consistency when people rate movies and instead looking for patterns in people's biases can yield remarkable gains in accuracy. Some excerpts:
When [rating movies] ... a user is being asked to perform two separate tasks.

First, they are being asked to estimate their preferences for a particular item. Second, they are being asked to translate that preference into a score.

There is a significant issue ... that the scoring system, therefore, only produces an indirect estimate of the true preference of the user .... Different users are translating their preferences into scores using different scoring functions.

[For example, people] use the rating system in different ways -- some reserving the highest score only for films that they regard as truly exceptional, others using the score for films they simply enjoy .... Some users [have] only small differences in preferences of the films they have rated, and others [have] large differences .... Incorporation of a scoring function calibrated for an individual user can lead to an improvement in results.

[Another] powerful [model] we found was to include the impact of the date of the rating. It seems intuitively plausible that a user would allocate different scores depending on the mood they were in on the date of the rating.
Gavin has done quite well in the Netflix Prize; at the time of writing, he was in eighth place with an impressive score of .8684.

Galvin's paper is a light and easy read. Definitely worthwhile. Galvin's work forces us to challenge our common assumption that people are objective when providing ratings, instead suggesting that it is quite important to detect biases and moods when people rate on a 1..5 scale.

Another paper given in the workshop that I found interesting was Takacs et al, "Investigation of Various Matrix Factorization Methods for Large Recommender Systems" (PDF). In addition to the very nice summary of matrix factorization (MF) methods, the paper at least begins to address the practical issue of handling online updates to ratings, offering "an incremental variant of MF that efficiently handles new users/ratings, which is crucial in a real-life recommender system."

Finally, a third paper that was presented in the main KDD session, "Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model" (PDF), by Yehuda Koren from the leading Bellkor team, is an excellent discussion of combining different popular approaches to the Netflix contest. Some extended excerpts:
The two more successful approaches to CF are latent factor models, which directly profile both users and products, and neighborhood models, which analyze similarities between products or users.

Neighborhood models ... [compute] the relationship between items or ... users. An item-oriented approach evaluates the preferences of a user to an item based on ratings of similar items by the same user. In a sense, these methods transform users to the item space by viewing them as baskets of rated items .... Neighborhood models are most effective at detecting very localized relationships.

Latent factor models such as ... SVD ... [transform] both items and users to the same latent factor space ... [and] tries to explain ratings by characterizing both products and users on factors automatically inferred from user feedback. For example, when products are movies, factors might measure obvious dimensions such as comedy vs. drama, amount of action, or orientation toward children ... [as well as] less well defined dimensions such as depth of character development or "quirkiness" .... Latent factor models are generally effective at estimating overall structure that relates simultaneously to most or all users. However, these models are poor at detecting strong associations amount a small set of closely related items, precisely where neighborhood models do best.

In this work, we suggest a combined model that improves prediction accuracy by capitalizing on the advantages of both neighborhood and latent factor approaches.
Like Gavin, Yehuda describes how to compensate for "systematic tendencies for some users to give higher ratings than others". Yehuda also discusses how implicit data on what users chose to rate and did not rate can be used for improving accuracy. And, Yehuda even addresses some of the differences between what works well in the Netflix Prize and what is necessary in a practical recommender system, talking about top-K recommendations, handling new users and new ratings without a full re-train of the model, using implicit feedback such as purchases and page views, and explaining the recommendations. Definitely worth the read.

Finally, let me mention that the Netflix Prize is about to give its second progress prize. The current leader narrowed the gap between the grand prize and the last progress prize more than half over the last year. In conversations at the conference, people alternated between being hopeful that the remaining gap could be closed with a few new clever insights and despairing over the slow progress recently and questioning whether the grand prize is possible to win.

1 comment:

Dr. Mark J. Boyd said...

The surprisal value of a rating is to some degree based on the previous ratings of the rating user, and to some degree the ratings of other users. A rating with no comparison or trending index is an unattached scalar. Coming up with the correct *relative* metric for a rating is definitely the best way to extract information from noise. I liked the idea that ratings might have the baseline index of "the users mood" on a particular day. Very interesting.