Wednesday, March 26, 2008

Using IMDb data for Netflix Prize

In a post titled "More data usually beats better algorithms", Anand Rajaraman talks about the success a team from his data mining class at Stanford found in the Netflix recommendation contest by using IMDb data to supplement the Netflix ratings training set.

Some excerpts:
Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?

Team B got much better results, close to the best results on the Netflix leaderboard! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize.

But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often surprised that many people in the business, and even in academia, don't realize this.
I do suspect that adding more data from IMDb, Amazon, or some other source will be necessary for anyone to win the Netflix Prize. As I said after working on the data a bit last year, "In my analyses, data simply seemed too sparse in some areas to make any predictions, and supplementing with another data set seemed like the most promising way to fill in the gaps."

Please see also my previous posts, "The advantages of big data and big clusters", "Better understanding through big data", Netflix Prize enabling recommender research", and "The Netflix Prize and big data".

Update: Yehuda Koren from the currently top-ranked BellKor team swung by and commented:
Our experience with the Netflix data is different.

IMDB data and the likes gives more information only about the movies, not about the users ... The test set is dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra information about the movies.

Our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.
Update: Anand has a follow-up post. Some choice excerpts:
Why not have more data and better algorithms?

Scalable algorithms involve only a fixed number of sequential scans and sorts of data (since large data sets must necessarily reside on disk and not RAM). Most algorithms that require random access to data or take time greater than O(N log N) are not scalable to large data sets. For example, they cannot easily be implemented using methods such as MapReduce.

Thus, choosing a more complex algorithm can close the door to using large data sets, at least at reasonable budgets that don't involve terabytes of RAM.

In the case of web search, Google made a big leap by adding links and anchor text, which are independent data sets from the text of web pages. In the case of AdWords, the CTR data was an independent data set from the bid data. And Google ... became a domain registrar so they could add even more data about domain ownership and transfers into their ranking scheme. Google consistently has believed and bet on more data, while trumpeting the power of their algorithms.

You might think that it won't help much to add more of the same data, because diminishing returns would set in ... In many important cases, adding more of the same data makes a bigger difference than you'd think ... [such as when] the application sees the data embedded in a very-high-dimensional space.

6 comments:

Anonymous said...

Our experience with the Netflix data is different.
While extra information could be very useful, it must be orthogonal to what we already have. It is important realize that IMDB data and the likes gives more information only about the movies, not about the users.
However, the test set is dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra information about the movies.

Of course, weak collaborative-filtering (CF) models can benefit from such IMDB information, but our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.

Yehuda Koren
The BellKor team

Anonymous said...

But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often surprised that many people in the business, and even in academia, don't realize this.

How do both approaches compare to more data with ever-better algorithms? Combine better algo intelligence with more data? Does that beat more data with weak intelligence?

Anonymous said...

I'm very new to CF but from what you said yehuda the netflix CF is user to user. Like movie choices group similar user right? I wish Netflix would add some actor, director, screenwriter rating to the mix as some movies are good but John Travolta is always bad.

Podunk said...

Something that still seems to be an open question is whether one approach is better than another in a general sense. For the Netflix test set, Yehuda implies that Netflix has (probably intentionally) biased it to cases where we don't know much about the user. They've made it a bit easier on us by focusing on movies with high support. In this case, and with the extensive data conditioning BellKor is doing, additional movie information may be just gravy that doesn't add much information (or even detracts from the less subjective information derived from the historical data).

But there's still an open question of the general case, where the movie may not have good support.

Either way, the Stanford example is too limited to draw a firm conclusion. Who is to say if the "more complex" algorithm was a good comparison to the IMDB algorithm? Were they doing PCA, which in at least some sense should make up for any advantage of the IMDB information, or was the more complex algorithm purely CF-based?

Anybody that has a good cite on the relative merits of well-conditioned machine-learned recommendations vs. external data augmented recommendations, I'd love to see it, though.

Josh

Anonymous said...

A simple question: how would using imdb data be considered possible, unless a data fee were paid; and imdb already has a direct relationship with Amazon?

Just wondering aloud.

Anonymous said...

I think Yehuda is right. Any of the latent semantic indexing approaches, SVD, etc. all factor the ratings into features, which represent abstract concepts which are typically represented by metadata.

I ended up gathering metdata from the Netflix BOB data URLs. For example, this is the kind of data for "Dinosaur Planet"

{"id":"60034316",
"title":"Dinosaur Planet",
"mpaa":"NR",
"year":2003,
"synopsis":"Explore the past and relive the sights, sounds and sensations of the dinosaurs' world, as cutting-edge filming techniques and hard science combine to create four unique narratives that illustrate the struggle to survive in the violent Cretaceous period -- all from the dinosaur's point of view. The dinosaurs learn to hunt, dodge natural disasters and simply live to see another day amid predators. Actor Christian Slater narrates each adventure.",
"genre":"Documentary",
"starring":[{"id":20055255,"name":"Scott Sampson"},{"id":86462,"name":"Christian Slater"}],
"starClarityData":{"average":"3.8"}}

I used a naive bayesian approach, dumping all metadata from all user movies rated 1, 2, 3, 4, or 5 into 5 'bins'. Then, to make a new prediction, I took the metadata for the movie, and used bayesian matching to see which of the 5 groups of metadata tags was the most likely. A weighted average was used for all 5 groups to produce a rating for that movie.

What I found was that it only achieved a 1.12 RMSE on the qualifying set, although it had only 0.5 RMSE on the probe data set. Presumably this was due to the probe movies being included in the 5 bayesian 'bins', so that predictions were almost perfect.

In order to try and get the predictions closer, I used results from a prior SVD run, based on the Timely Development code, which had 0.9261 RMSE to help 'guess' what bin the qualifying set movies would be included in. If the movie was spot on (within 0.1 point of an integer rating), I dumped the metadata for the movie into that bin. Otherwise, I tried putting the metadata into the Left or Right bin (if the SVD value was predicted rating of 3.4, I would put it into the Left side [bin 3], compute a prediction, then the Right side [bin 4], make another prediction, then combine the two). This approach only produced an RMSE of 0.9459 on the qualifying set.

So, barring bugs in my implementation, I suspect that there is little to gain from using more metadata, since any 'feature'-based approach should in theory be achieving the same factorizations.

Maybe someone can figure out how to turn this additional metadata into a lower rating?