Thursday, May 01, 2008

Random walks of the click graph

I recently have been coming across several interesting efforts that use random walks of a click graph for relevance or keyword suggestions in search and advertising.

The basic idea is to build a bipartite graph with queries on one side, documents or ads on the other, and weighted links representing clicks after making the query on the documents or ads. Then, we start at one point of the graph, randomly choose a link to walk across, and keep doing that for several steps, seeing where we end up most frequently.

One nice example of this is the WWW 2008 short paper "Simrank++: Query Rewriting through Link Analysis of the Click Graph" (PDF + tech report PDF) by Ioannis Antonellis, Hector Garcia-Molina, and Chi-Chao Chang.

An excerpt from the longer tech report:
We focus on query rewrites based on the recent history of ads displayed and clicked on.

The back-end generates a historical click graph that records the clicks that were generated by ads when a user inputs a given query. The click graph is a weighted bi-partite graph, with queries on one side and ads on the other.

Our techniques identify not only queries that are directly connected by an ad (e.g., users that submit either "mp3" or "i-tunes" click on ad an for "iPod") but also queries that are more indirectly related.

Notice that our query rewriting problem is a type of collaborative filtering (CF) problem. We can view the queries as "users" who are recommending "ads" by clicking on them. When we identify similar queries, we are finding queries that have similar recommendations, just like in CF, where one finds users that have similar tastes.
Another short paper at WWW 2008, this one by Martin Szummer and Nick Craswell, "Behavioral Classification on the Click Graph" (PDF), looked at using random walks of the click graph for classifying adult websites.

An excerpt:
We present a method for exploiting user browsing activity to
classify web pages ... as adult (inappropriate for minors) or not.

When a user types a query and then clicks a search result, they create a query-URL association. By logging a large number of click events, the search engine can amass a large number of query-URL pairs. These can be viewed as a bipartite graph, where each query is adjacent to one or more URLs and each URL is adjacent to one or more queries.

Our method exploits the structure of this graph. We implicitly look for clusters of nodes (representing queries or URLs) that share clicks to/from each other. We assume that nodes that cluster well are likely to belong to the same underlying category.

The implicit clustering is done via a Markov random walk on the graph. The walk captures the transitivity of class similarity on the graph: if A is co-clicked with B and B is co-clicked with C, then A is also likely to be related to C.
To me, this had the feel of a click graph-based version of TrustRank (PDF), propagating reputation across a click graph instead of a link graph.

Please see also Nick and Martin's SIGIR 2007 paper, "Random Walks on the Click Graph" (PDF), which looks using click graph random walks for relevance rank in image search.

Please see also yet another WWW 2008 paper, "Using the Wisdom of the Crowds for Keyword Generation" (PDF), by Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, and Rakesh Agrawal, which also looks at using random walks of the click graph in order to suggest keywords for advertising.

No comments: