Wednesday, January 31, 2007

The Netflix Prize and big data

David Leonhardt at the NYT writes about the Netflix Prize in his article today "You Want Innovation? Offer a Prize".

Some excerpts:
When Netflix announced its prize in October, [CEO Reed] Hastings said he didn't necessarily expect contestants to make a lot of quick progress.

Computer scientists say that Cinematch, along with Amazon's recommendation system, was already one of the most sophisticated. "We thought we built the best darn thing ever," Mr. Hastings said.

But Mr. Hastings underestimated the power of an open competition. Within days, many of the top people in a field known as machine learning were downloading the 100 million movie ratings Netflix had made public.

The experts have since been locked in a Darwinian competition to build a better Cinematch, with the latest results posted on a leader board at Netflix's Web site.

With four and a half years to go in the contest, [the lead team] is already 6.75 percent better than Cinematch. And Netflix hasn't had to pay for their time.

In effect, the company "has recruited a large fraction of the machine learning community for almost no money," as [Geoffrey] Hinton, [a University of] Toronto [Computer Science] professor, put it.
While the prize money adds excitement, I think most of the enthusiasm from the research community is simply from having access to such a massive data set.

Until Netflix released their movie ratings data for this contest, the largest data sets available for experimenting with and evaluating recommender systems were the Movie Lens and EachMovie data sets. Those data sets are two orders of magnitude smaller.

Netflix made 100M ratings by 480k customers over 30k titles available to researchers. A data set of that size simply was not available until now.

This opens up new opportunities for research on recommender algorithms. Not only are there considerable challenges in scaling recommender algorithms to big data, but also, as Googler Peter Norvig points out, we may have more to learn from improving our ability to work with massive training data than we do from twiddling algorithms running over small data.

Yes, the money and visibility of the Netflix Prize is a motivator, I am sure. But, there is also excitement from getting access to big data that previously only was available inside companies like Amazon, Netflix, Yahoo, or Google.

See also my original post on the Netflix contest, "Netflix offers $1M prize for improved recs".

See also a Sept 2006 talk (PDF) by Netflix's VP of Recommendation Systems Jim Bennet that has details on their Cinematch recommender system.

4 comments:

jeremy said...

as Googler Peter Norvig points out, we may have more to learn from improving our ability to work with massive training data than we do from twiddling algorithms running over small data

But now that all these machine learning research groups have access to this data, aren't they now just "twiddling algorithms", also? If it were just a matter of having big data, there would be no reason for this contest, right?

The one thing that still doesn't sit right with me about the "big data only" solution is that, no matter how big your dataset, I can pretty much guarantee you that there is going to be a long tail of content, to which only small data is attached. For example, if Star Wars has tens of thousands of reviews, there are going to be a thousands of indie and obscure scifi films with only a single review each, or maybe even zero reviews.

Even if the review set continues growing, so will the movie set. There will always be a long tail of sparsely tagged or reviewed movies. Which means, I think, that there will always be a need for "twiddling algorithms", non?

Greg Linden said...

Yep, but I think the twiddling should happen once the algorithms are running on big data sets, not when they are working against toy problems. Do you disagree?

jeremy said...

I'm really of two minds about it, Greg. On the one hand, you're absolutely right; one should twiddle on as much data as you have the processing power to handle.

On the other hand, I think there are a good number of real problems out there that will, by definition, never have big data associated with them. Take search personalization. If you really want true personalization, then you really should train only on your own data. But this data is never going to be big!

I asked Norvig about this a few months ago, and he says that Google is taking the same solution that you probably took at Amazon: aggregation of profiles and formation of broad (albeit hidden from the searcher) user classes. Turn small data into big data, for more robust solutions.

But to me, that aggregation is in itself a form of small-data twiddling. When you are using machine learning to fit a model to data, using some objective function, the choice of objective function is as much a part of the algorithm as the parametric form. By aggregating small data into big data, you are simply twiddling your small data objective function. Aggregate well, and you'll get a good solution. Aggregate poorly, and you won't. But there is no separate "big data" that tells you the best way to aggregate; that is a decision you have to make based on small-data only.

What I am trying to say is that there is still tons of value in optimizing on small data, alone.

So again, I'm of two minds about it.

leafar said...

Great post ! Excellent discussion !
I think jeremy has a good point.
One think with data that are not yours ... is the way they are collected.

I think both approach are interesting.

One point with the Netflix prize is that it does not allow disruption. You have to compete on RMSE (who cares if you are slightly better for average predictions) and have no way to get innovation in the harvest process that still matters a lot.
Taste & Entertainment are very specific