I probably started drooling when I first noticed this paper, "
Google News Personalization: Scalable Online Collaborative Filtering" (
PDF), by four Googlers that is being presented at the WWW 2007 Conference this weekend.
The paper does not disappoint. It is an awesome example of what can be done at Google scale with their data, traffic, and massive computational cluster.
The paper tested three methods of making news recommendations on the Google News front page. From the abstract:
We describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts.
MinHash and PLSI are both clustering methods; a user is matched to a cluster of similar users, then they look at the aggregate behavior of users in that cluster to find recommendations. Covisitation is an item-based method that computes which articles people tend to look at if they looked at a given article (i.e. "Customers who visited X also visited...").
The paper does a nice job motivating the use of recommendations for news:
The Internet has no dearth of content. The challenge is in finding the right content for yourself: something that will answer your current information needs or something that you would love to read, listen or watch.
Search engines help solve the former problem; particularly if you are looking for something specific that can be formulated as a keyword query.
However, in many cases, a user may not even know what to look for ... Users ... end up ... looking around ... with the attitude: Show me something interesting.
In such cases, we would like to present recommendations to a user based on her interests as demonstrated by her past activity on the relevant site .... [and] the click history of the community.
The authors then explain that what makes this problem so difficult is doing it at scale in real-time over rapidly changing data.
Google News (http://news.google.com) is visited by several million unique visitors ... [and] the number of ... news stories ... is also of the order of several million.
[On] Google News, the underlying item-set undergoes churn (insertions and deletions) every few minutes and at any given time the stories of interest are the ones that appeared in last couple of hours. Therefore any model older than a few hours may no longer be of interest.
The Google News website strives to maintain a strict response time requirement for any page views ... [Within] a few hundred milliseconds ... the recommendation engine ... [must] generate recommendations.
The authors note that, while Amazon.com may be doing recommendations at a similar scale and speed, the item churn rate of news is much faster than of Amazon's product catalog, making the problem more difficult and requiring different methods.
The Googlers did several evaluations of their system, but the key results are that two versions that combined all three of the approaches in different ways generated 38% more clickthroughs than just showing the most popular news articles. That seems surprisingly lower than our results for product recommendations at Amazon.com -- we found recommendations generated a couple orders of magnitude more sales than just showing top sellers -- but Google's results are still a good lift.
On the reason for the lower lift, I did end the paper with a couple questions about parts of their work.
First, it appears that the cluster membership is not updated in real-time. When discussing PLSI, the authors say they "update the counts associated with that story for all the clusters to which the user belongs" but that this is an approximation (since cluster membership cannot change in this step) and does not work for new users (who have no cluster memberships yet).
This is a typical problem with clustering approaches -- building the clusters usually is an expensive offline computation -- but it seems like a cause for concern if their goal is to offer "instant gratification" to users by changing when they click on new articles.
Second, it appears to me that the item-based covisitation technique will be biased toward popular items. In describing that algorithm, the authors say, "Given an item
s, its near neighbors are effectively the set of items that have been covisited with it, weighted by the age discounted count of how often they were visited."
If that is right, this calculation would seem to suffer from what we used to call the "Harry Potter problem", so-called because everyone who buys any book, even books like Applied Cryptography, probably also has bought Harry Potter. Not compensating for that issue almost certainly would reduce the effectiveness of the recommendations, especially since the recommendations from the two clustering methods likely also would have a tendency toward popular items.
I have to say, having worked on this problem with
Findory for the last four years, I feel like I know it well. Findory needs to generate recommendations for hundreds of thousands of users (not yet millions) in real-time. New articles enter and leave the system rapidly; newer stories tend to be the stories of interest. The recommendations must change immediately when someone reads something new.
This work at Google certainly is impressive. I would have loved to have access to the kind of resources -- the cluster, MapReduce, BigTable, the news crawl, the traffic and user behavior data, and hordes of smart people everywhere around -- that these Googlers had. Good fun, and an excellent example of the power of Google scale for trying to solve these kinds of problems.