Tuesday, June 21, 2005

MSN Search and Learning to Rank

In a post on MSN Search's Weblog, Ken Moss (GM, MSN Search) says:
In collaboration with Chris Burges and other friends from Microsoft Research, we now have a brand new ranker. The new ranker has improved our relevance and perhaps most importantly gives us a platform we think we can move forward on quicker than before. This new ranker also is based on technology with an awesome name -- it's a "Neural Net."
Microsoft Researcher Chris Burges was the lead author on a 2005 paper called "Learning to Rank using Gradient Descent" that does appear to try to use neural networks for relevance rank.

While applying machine learning techniques to relevance rank for web search is common, using neural networks is not. I am surprised to see neural networks used as part of the relevance rank in a system of this size and scope.

On an interesting side note, one of the co-authors on the paper, Tal Shaked, is now at Google. Tal appears to have been a PhD student of Oren Etzioni.

Update: A couple people ([1] [2]) have asked for someone to take this rather cryptic paper and translate it into something resembling English. I'm a geek, so I speak Geekish, but I'll do my best to translate this into English.

At a high level, the idea is to learn what search result documents are relevant to specific search queries. We're doing this so we can reorder the search results and put the most interesting ones up at the top.

This paper is using a neural network for the learning. Neural networks are pretty simple, no magic here. We take a bunch of data, "propagate" it through the network (basically, take a bunch of weighted sums of the inputs and munch them together), and get values out of the network.

This is supervised learning, so we start with a bunch of data that says things like, "For a search for 'personalized news', the most relevant search result is Findory.com." We then take that and a bunch of other data, run it through our network, and try to teach our system to do the right thing, to always say "Findory.com" when we ask what the most relevant search result is for a query for "personalized news".

The tricky part of this is "training" the network, which means learning what all the weights should be on all the links in the network. The cryptic parts of the paper (sections 3-5) are discussing the details of how they train the network. It's not really important for our purposes. What we care about is whether they were successful in finding weights that allow them to predict how relevant a document is to a given search query.

Some of the details of what they did are pretty interesting. For example, in Section 6.1, they say they "use 569 features" of documents as part of the input to their network. What that means is that they summarize each document by a list of 569 generalized properties of the document and then predict the relevance of the document using those properties. At least one of Chris Burges' other papers is on dimensionality reduction, so I assume these 569 features are automatically summarized from the documents using a preprocessing step, not simple features like the size of the document.

In plain English, they're not trying to predict how relevant each individual document is to each search query. They're trying to predict how certain features of documents determine the relevance of those documents to various search queries.

In Section 6.1, they describe their training set, just 17k queries with 1..5 labels for the relevance of just some of the documents. That's not a lot, especially because they say (in Section 6.2) that "only approximately 1%" of the documents are labeled. However, in a production system, you might be able to supplement this data using data from the logs. For example, if a lot of people click on "Geeking with Greg" on a search for "geeking", that document probably is relevant.

Also in Section 6.1, they say, "We chose to compute the NDCG at rank 15, a little beyond the set of documents initially viewed by most users." I think what this means is that the system implemented for this paper is a post-processing step over the normal search results. So, if you do a query at search.msn.com for "geeking", you get back 32981 results. Before you get to see the first page of results, this system would look at the first 15 of them and rerank them, possibly moving a more relevant result up to the #1 slot. Looking at only the first 15 results helps explain how the system is scalable, but it does limit the power of the system to surface relevant documents. However, it might be possible to integrate this neural network into the build of the search indexes, allowing it to look at many more documents.

In Section 6.2, they have a couple tables showing the predictive accuracy of the system, which appears to be under 50%. That accuracy seems pretty mediocre to me, but it's hard to tell without understanding the accuracy of other machine learning approaches. It is unfortunate that the paper doesn't spend more time on this. Offhand, it is not clear to me that a neural network is the best tool for this task, and the paper does little to address that question.


Greg Linden said...

Hi, Mark. To my knowledge, MSN hasn't talked about this anywhere else. I'm sure that Google and Yahoo improve their relevance rank using feedback from users -- self-learning, at least in some sense -- but I've never heard of anyone else using neural networks for web search results relevance rank at anything like this scale.

Anonymous said...

I cannot understand why it's surprising "to see neural networks used as part of the relevance rank in a system of this size and scope", for the concerns of the efficiency, or anything else?

I hope it's not too late to discuss about such old a post.

Greg Linden said...

Hi, Anonymous. I love NNets and worked with them a bit in the past, but I tend to agree with the criticism of them, that they can be inefficient, slow to learn, lack explainability (are opaque black boxes), and, generally, have the appearance of being a mediocre way (and often a lazier way) to do just about anything. That being said, I'm hardly an expert, especially on more recent work, and you shouldn't take my flip statement as too much more than me being surprised.