Specifically, they found that, instead of testing two search rankers in a normal A/B test (e.g. 50% of users see ranker A, 50% see ranker B), showing all searchers an interleaved combination of the two possible search result orderings makes it much easier to see which ranker people prefer. The primary explanation the authors give for this is that interleaving the results gives searchers the easier task of expressing a relative preference between the two rankers.
Some excerpts from the paper:
Unlike expert judgments, usage data ... such as clicks, query reformulations, and response times ... can be collected at essentially zero cost, is available in real time, and reflects the value of the users, not those of judges far removed from the users' context. The key problem with retrieval evaluation based on usage data is its proper interpretation.Please see also my older post, "Actively learning to rank", which summarizes some earlier very interesting work by Filip and Thorsten.
We explored and contrasted two possible approaches to retrieval evaluation based on implicit feedback, namely absolute metrics and paired comparison tests ... None of the absolute metrics gave reliable results for the sample size collected in our study. In contrast, both paired comparison algorithms ... gave consistent and mostly significant results.
Paired comparison tests are one of the central experiment designs used in sensory analysis. When testing a perceptual quality of an item (e.g. taste, sound) ... absolute (Likert scale) evaluations are difficult to make. Instead, subjects are presented with two or more alternatives and asked ... which of the two they prefer.
This work proposes a method for presenting the results from two [rankers] so that clicks indicate a user's preference between the two. [Unlike] absolute metrics ... paired comparison tests do not assume that observable user behavior changes with retrieval quality on some absolute scale, but merely that users can identify the preferred alternative in direct comparison.
Update: Filip had a nice update to this work in a SIGIR 2010 paper, "Comparing the Sensitivity of Information Retrieval Metrics". Particularly notable is that only x10 as many clicks are required as explicit judgments to detect small changes in relevance. Since click data is much easier and cheaper to acquire than explicit relevance judgments, this is another point in favor of using online measures of relevance rather than the older technique of asking judges (often a lot of judges) to compare the results.
2 comments:
There seems to be a trend moving away from "rate how much you like something" to "pick which one you prefer." Machine translation evaluation has been moving towards this too. I wonder if we'll see a variant of Netflix where we don't rate movies 1-5 but instead do "hot or not" between pairs of movies. Is someone already doing this (aside from Hot or Not of course)?
Behavioral measures, in my opinion, almost always beat less specific "opinion" measures. A good deal of psychological literature has suggested that reported intentions are only moderately correlated with future behavior and social performance pressures can affect how you rate something. Additionally, from a statistical power point of view, a within subjects design (same user gets both versions) is superior to a between subjects design. When the difference between an A and B test is not overtly disruptive to the user experience (as with ordering results) it makes sense to leverage the power advantage you get. It's a more elegant experimental design.
Post a Comment