Friday, October 24, 2008

Evaluating search result pages

Yahoo Chief Scientist Jan Pedersen recently wrote a short position paper, "Making Sense of Search Result Pages" (PDF), that has some interesting tidbits in it. Specifically, it advocates for click-based methods for evaluating search result quality and mentions using toolbar data to see what people are doing after leaving the search result page.

Some extended excerpts:
Search engine result pages are presented hundreds of millions of times a day, yet it is not well understood what makes a particular page better from a consumer's perspective. For example, search engines spend large amounts of capital to make search-page loading latencies low, but how fast is fast enough or why fast is better is largely a subject of anecdote.

Much of the contradiction comes from imposing a optimization criterion ... such as discounted cumulative gain (DCG) ... that does not account for perceptual phenomena. Users rapidly scan search result pages ... and presentations optimized for easy consumption and efficient scanning will be perceived as more relevant.

The process Yahoo! search uses to design, validate, and optimize a new search feature includes ... an online test of the feature ... [using] proxy measures for the desired behaviors that can be measured in the user feedback logs.

Search engine query logs only reflect a small slice of user behavior -- actions taken on the search results page. A more complete picture would include the entire click stream; search result page clicks as well as o ffsite follow-on actions.

This sort of data is available from a subset of toolbar users -- those that opt into having their click stream tracked. Yahoo! has just begun to collect this sort of data, although competing search engines have collected it for some time.

We expect to derive much better indicators of user satisfaction by consider the actions post click. For example, if the user exits the clicked-through page rapidly then one can infer that the information need was not satisfied by that page.
For more on how search engines may be using toolbar data, please see my previous post, "Google Toolbar data and the actual surfer model".

For more on using click-based methods to evaluate search results, please see a post by Googler Ben Gomes, "Search experiments, large and small" as well as my previous posts, "Actively learning to rank" and "The perils of tweaking Google by hand".

By the way, rumor has it that Jan Pedersen left Yahoo and is now at A9. Surprising if true.

Update: Rumor confirmed. Jan Pedersen is now at A9.


Anonymous said...

But what if I am doing a search, just to check a simple fact. I click the link, verify the fact, and then close/exit the page almost immediately. The page could be very relevant, but because I am in "fact validation" mode, my clickstream behavior is different.

Similarly, what if I am in exploratory search mode? Ie.. what if I am trying to learn more about a topic. Then I will be clicking *lots* of links. And spending lots of time on those pages, and maybe even going further from some of those pages by following their links.

But in the end, only some of those pages that I've spent a lot of time on are going to be relevant, and many are going to be non-relevant. But my click-behavior on both will be the same, because I am in exploratory mode.

Does this mean that people doing click-based analysis and evaluation completely ignore entire swaths of information seeking behaviors?

Daniel Tunkelang said...

I had a nice conversation with Marti Hearst about the question of defining "user satisfaction" at the HCIR workshop, after Raman Chandrasekar's talk about search experience satisfaction.

Chandra suggested that user are satisfied when they use a system because they want to, rather than because they have to. I'm inclined to agree, and suspect that the best way to measure satisfaction with a tool or application is based on whether users voluntarily continue using it.

The only catch is that, in order to measure satisfaction this way, you have to mitigate status quo bias, making the adoption vs. switching costs as even as possible. For example, require someone to use a new tool for a week, and then ask them if they want to continue using it or go back to their more familiar tool.