Wednesday, June 20, 2007

Latest on the Netflix prize

Eight months into the first year, there has been impressive progress on the Netflix prize.

The current leader has a 7.7% improvement. While each step in closing the remaining 2.3% gap to win the $1M prize will be harder and harder, it is remarkable how far the entrants have come.

I have to admit that I have spent a fair amount of time playing with the contest and the Netflix data. It is quite a bit of fun.

Other than some initial attempts at very simple things like predicting averages or modified averages to get a baseline, most of my attempts were either using variants of traditional collaborative filtering (finding similar users) or item-to-item collaborative filtering (finding similar items). I did get modest improvements, but nothing like the performance of the current leaders. I also played with some simple clustering; my results on that were quite poor.

As it turns out, Netflix currently uses a type of item-to-item collaborative filtering for their recommendations, as Netflix's VP of Recommendation Systems Jim Bennet described in a Sept 2006 talk (PDF).

I suspect substantial improvements, then, will require something quite a bit different than what Netflix is doing. This likely rules out item-to-item collaborative filtering. Experimenting with other techniques, I suspect, will be more likely to bear fruit.

I also suspect that additional data may be useful on this problem. While the creators of the Netflix contest apparently did not expect this much progress this fast, it may be the case that it simply is impossible to get the full 10% improvement using the ratings data alone. In my analyses, data simply seemed too sparse in some areas to make any predictions, and supplementing with another data set seemed like the most promising way to fill in the gaps.

It was very fun working on the contest. As much as I would like to keep plugging away, I am starting to feel constrained by my available hardware (4+ year old Linux boxes), and dropping cash on servers just to keep playing seems excessive. Moreover, if I am going to spend more time and money on this kind of thing, it really should be on Findory. Too bad, it is a cool data set.

By the way, I do want to mention one thing about the structure of the contest. I think the problem statement is a little off given the business needs of Netflix. In particular, the contest requires a recommender system which can predict movies you will hate or be lukewarm to, not just the movies you will like.

A system that predicts only what you will like, often referred to as TopN recommendations, is what most businesses want from recommendations. They want to surface interesting products to customers, helping customers discover products they have never seen. In Netflix's case, they mostly should want to help you discover movies to add to your rental queue.

You might think that a system that is the best at the Netflix contest problem would be the best at the TopN problem, but that is not the case. To see that, take a system that perfectly predicts the TopN next items you will want, but makes mistakes when you are lukewarm to a product. The RMSE of that recommender would be poor in the Netflix contest (because of inaccuracies in predicting ~3 star items), despite the obvious value of the recommender.

See also by Joe Weisenthal's recent post on TechDirt, "Netflix Experiment In Outsourced Innovation Showing Good Results".

See also my previous posts ([1] [2] [3] [4]) on the Netflix recommendations contest.

4 comments:

MakkhiChoose said...

I agree with your suggestion that the dataset is sparse in some areas. I am also guessing you must have gotten an RMSE of around 2.00 when considering only the movies that were rated a single star. Reducing that error by 1, which translates into doing better than simply predicting close to the average on about 75,000 ratings, should qualify one for the Grand Prize.
Playing around with the data, I could not tease out any intuitions or statistical regularities in user s or movies with common vertices* and 1-star ratings (Vertex as in defining a movie-user pair as an edge in a bipartite graph). IMDB, however, has a great dataset of tags that seem to correlate revealed dislike (through ratings) with quirks that would not be considered among the defining aspects of a movie (example: http://imdb.com/title/tt0374900/keywords )

Also, predicting those ones is probably the most vital aspect because many of these ratings are for movies that Cinematch has suggested. Couple that with the risk aversion involved in people deciding on movies with the opportunity cost of forgoing other movies in the queue, you can see why there is a low-ish variance and high mean (about 3.6) across the ratings. Netflix will need also need to come up with better incentives like allowing a free, extra dvd rental from a selected group of movies to allow user experimentation and get data with more variance.

Anonymous said...

In case you haven't seen this paper before. I believe this is the approach that WXYZConsulting team uses.

Greg Linden said...

Thanks, Siddharth. I had not yet seen that paper.

Anonymous said...

Greg,

Could you post something about the algorithms you used in findory?. What kind of collobrative filtering algorithms did you use for netflix contest?. A write up on some algorithms you have used and positive & negative feedback will be interesting.