Thursday, July 12, 2007

Netflix Prize enabling recommender research

Yi Zhang and Jonathan Koren from UC Santa Cruz have a paper, "Efficient Bayesian Hierarchical User Modeling for Recommender Systems" (PDF), at the upcoming SIGIR 2007 conference.

The paper focuses on "improving the efficiency and effectiveness of Bayesian hierarchical linear models, which have a strong theoretical basis and good empirical performance on recommendation tasks." Specifically, they significantly improve the speed of running expectation-maximization by ignoring the "many dimensions that are not 'related' to a particular user", focusing on "related user-feature pairs".

That is interesting, especially since Yi Zhang is part of the team that is currently in 10th place on the Netflix Prize Leaderboard.

Even so, what primarily excited me about this paper is that it is an excellent example of research on scaling recommender systems that simply would not be possible without the Netflix data set. The Netflix data set is two orders of magnitude bigger than the MovieLens and EachMovie data sets that were previously available.

Time and time again, I have been frustrated to read recommender system papers that are twiddling around on toy problems. They often achieve minor improvements in quality, but at the cost of efficiency, usually presenting techniques that will not scale to catalogs or numbers of customers of even modest size.

One might argue that these improvements are of theoretical interest, but there is some doubt about even that. As Googler Peter Norvig said:
Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm ... is performing better than the best algorithm on less training data.

Worry about the data first before you worry about the algorithm.
The problem, as Michele Banko and Eric Brill noted (PDF) in 2001, are that algorithmic twiddles that yield improvements at small scale may no longer yield improvements at large scale and training with more data often results in bigger improvements than algorithm changes when training with small data.

Now that Netflix has made big data available to researchers, we will see a lot more work on large scale, practical recommender systems like this Zhang and Koren paper. And we all will benefit.

See also my Jan 2007 post, "The Netflix Prize and big data".

[Thanks, Siddharth, for pointing out the Zhang and Koren paper]

Update: Two months later, the KDD Cup 2007 is done, and it focused on the Netflix Prize. Don't miss the great collection of papers. Very impressive how much work on recommender systems at large scale is coming out of this contest.

Update: A ECML 2007 paper, "Pricipal Component Analysis for Large Scale Problems with Lots of Missing Values" (PDF), has a nice discussion of using PCA for recommendations at large scale, particularly around the issue of overfitting. [Found via Simon Funk]

1 comment:

burtonator said...

Yeah... Agreed.

I notice the same problem with the open source community. For example there are a lot of open source crawlers but they don't scale.

They never have to worry about HUGE amounts of data.

Also, if you know of any researchers who need LOTS of blog data we'd be more than willing to give them a access to our crawler in Spinn3r.

We have a few researchers playing with the data now and loving it.

Of course it's about 24GB per day so I think we qualify for the 'big data' argument :)