Friday, June 05, 2009

On the front lines of the Netflix Prize

Robert Bell, Jim Bennett, Yehuda Koren, and Chris Volinsky have an article in the May 2008 IEEE Spectrum, "The Million Dollar Programming Prize", with fun tales of their work in the Netflix Prize along with a summary of the techniques that are performing best.

Some excerpts:
The nearest neighbor method works on the principle that a person tends to give similar ratings to similar movies. [For example, if] Joe likes three movies ... to make a prediction for him, [we] find users who also liked those movies and see what other movies they liked.

A second, complementary method scores both a given movie and viewer according to latent factors, themselves inferred from the ratings given to all the movies by all the viewers .... Factors for movies may measure comedy versus drama, action versus romance, and orientation to children versus orientation toward adults. Because the factors are determined automatically by algorithms, they may correspond to hard-to-describe concepts such as quirkiness, or they may not be interpretable by humans at all .... The model may use 20 to 40 such factors to locate each movie and viewer in a multidimensional space ... then [we predict] a viewer's rating of a movie according to that movie's score on the dimensions that person cares about most.

Neither approach is a panacea. We found that most nearest-neighbor techniques work best on 50 or fewer neighbors, which means these methods can't exploit all the information a viewer's ratings may contain. Latent-factor models have the opposite weakness: They are bad at detecting strong associations among a few closely related films, such as The Lord of the Rings trilogy.

Because these two methods are complementary, we combined them, using many versions of each in what machine-learning experts call an ensemble approach. This allowed us to build systems that were simple ... easy to code and fast to run.

Another critical innovation involved focusing on which movies a viewer rated, regardless of the scores. The idea is that someone who has rated a lot of fantasy movies will probably like the Lord of the Rings, even if that person has rated the other movies in the category somewhat low ... This approach nicely complemented our other methods.
Please see also my earlier post, "Netflix Prize at KDD 2008", which points at papers with more details than the IEEE article, including another recent paper by Yehuda Koren.

1 comment:

tim said...

Hi Greg,

I've been leveraging a real-time "automated ontology", for want of a better word, to derive just the sort of "factors" or "concepts" referred to here. We literally "read" the titles/descriptions, comments on them, and chat sessions about them as structured data then run factor analyses and some subsequent clsutering to do the same thing proposed byt he subekct of this blog.

Our approach generates it own "hidden" meta data instead of depending upon the largely static and and overtly finite deterministic nature of meta-data.

drop me an email at or call me. you have my number last time I checked. I think we're linked @faceboook and LinkedIn, as well.