tag:blogger.com,1999:blog-6569681.post5759919969864781927..comments2024-01-15T13:17:33.771-08:00Comments on Geeking with Greg: Testing rankers by interleaving search resultsGreg Lindenhttp://www.blogger.com/profile/09216403000599463072noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-6569681.post-24908340720224087772008-11-12T10:03:00.000-08:002008-11-12T10:03:00.000-08:00Behavioral measures, in my opinion, almost always ...Behavioral measures, in my opinion, almost always beat less specific "opinion" measures. A good deal of psychological literature has suggested that reported intentions are only moderately correlated with future behavior and social performance pressures can affect how you rate something. Additionally, from a statistical power point of view, a within subjects design (same user gets both versions) is superior to a between subjects design. When the difference between an A and B test is not overtly disruptive to the user experience (as with ordering results) it makes sense to leverage the power advantage you get. It's a more elegant experimental design.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-6569681.post-6801344958926742102008-11-12T06:14:00.000-08:002008-11-12T06:14:00.000-08:00There seems to be a trend moving away from "rate h...There seems to be a trend moving away from "<I>rate how much you like something</I>" to "<I>pick which one you prefer</I>." Machine translation evaluation has been moving towards this too. I wonder if we'll see a variant of Netflix where we don't rate movies 1-5 but instead do "hot or not" between pairs of movies. Is someone already doing this (aside from <I>Hot or Not</I> of course)?Anonymousnoreply@blogger.com