Thursday, November 20, 2008

Finding task boundaries in search logs

There has been more talk lately, it seems to me, on moving away from stateless search where each search is independent and toward a search engine that pays attention to your previous searches when it tries to help you find the information you seek.

Which makes that much more relevant a paper by Rosie Jones and Kristina Klinkner from Yahoo Research at CIKM 2008, "Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs" (PDF).

Rosie and Kristina looked at how to accurately determine when a searcher stops working on one task and starts looking for something new. The standard technique people have used in the past for finding task boundaries is to simply assume that all searches within a fixed period of time are part of the same task. But, in their experiments, they find that "timeouts, whatever their length, are of limited utility in identifying task boundaries, achieving a maximum precision of only 70%."

Looking at the Yahoo query logs more closely to explain this low accuracy, they find some surprises, such as the high number of searchers that work on multiple tasks simultaneously, even interleaving the searches corresponding to one task with the searches for another.

So, when the simple stuff fails, what do most people do? Think up a bunch of features and train a classifier. And, there you go, that's what Rosie and Kristina did. They trained a classifier using a set of features that combined characteristics of searcher behavior (e.g. people searching for [tribeca salon] after [new york hairdresser]) with characteristics of the queries (e.g. lexically similar or return similar results from a search engine), eventually achieving much higher accuracy rates on finding task boundaries.

As the authors say, being able to accurately segment tasks could improve our ability to evaluate search engines. In particular, we could seek to minimize the amount of time needed by searchers "to satisfy an information need or fulfill a more complex objective" rather than just looking at click and usage data for one query at a time. Judging search engines by how well they help people get things done is something that, in my opinion, is long overdue.

Please see also my earlier post, "Tasks, not search, at DEMOfall2008", where Head of Yahoo Research Prabhakar Raghavan said that people really don't want to search; what they really want is to fulfill their tasks and get things done.

Wednesday, November 19, 2008

Detecting spam just from HTTP headers

You have to love research work that takes some absurdly simple idea and shows that it works much better than anyone would have guessed.

Steve Webb, James Caverlee, and Calton Pu had one of these papers at CIKM 2008, "Predicting Web Spam with HTTP Session Information" (PDF). They said, everyone else seems to think we need the content of a web page to see if it is spam. I wonder how far we can get just from the HTTP headers?

Turns out surprisingly far. From the paper:
In our proposed approach, the [crawler] only reads the response line and HTTP session headers ... then ... employs a classifier to evaluate the headers ... If the headers are classified as spam, the [crawler] closes the connection ... [and] ignores the [content] ... saving valuable bandwidth and storage.

We were able to detect 88.2% of the Web spam pages with a false positive rate of only 0.4% ... while only adding an average of 101 [microseconds] to each HTTP retrieval operation .... [and saving] an average of 15.4K of bandwidth and storage.
It appears that web spammers tend to use specific IP ranges and put unusual gunk into their headers (e.g. "X-Powered-By" and "Link" fields), which makes it fairly easy to pick them out just from their headers. As one person suggested during the Q&A for the talk, spammers probably would quickly correct these oversights if it became important, but you still have to give credit to the authors for this cute and remarkably effective idea.

If you might enjoy another example of making remarkable progress using a simple idea, please see my earlier post, "Clever method of near duplicate detection". It summarizes a paper about uniquely identifying the most important content in a page by looking around the least important words on the page, the stop words.

Monday, November 17, 2008

Measuring offline ads by their online impact

Googlers Diane Lambert and Daryl Pregibon had a paper at AdKDD 2008, "Online Effects of Offline Ads" (PDF), with a fun look at how far we can get measuring the impact of offline advertising by increases in search queries or website visits.

Some excerpts:
One measure of offline ad effectiveness is an increase in brand related online activity .... As people spend more time on the web, [the] steps toward purchase increasingly include searching for the advertiser's brand or visiting the advertiser's websites, even if the ad campaign was in an offline medium such as print, radio, or TV.

There are two obvious strategies for estimating the [gain] ... We can assume that the past is like the present and use daily outcomes before the campaign ran ... [but] the "before" number of visits ... is not necessarily a good estimate ... if interest in the product is expected to change over time even if no ad campaign is run. For example, if an ad is more likely to be run when product interest is high, then comparing counts-after to counts-before overstates the effect of the campaign.

Alternatively, we could estimate the [gain] ... by the outcome in control DMAs, which are markets in which the ad did not appear ... One problem, though, is that the advertiser may be more likely to advertise in DMAs where the interest in the product is likely to be high.
The paper goes on to detail the technique used and the ability of the technique to detect changes in very noisy traffic data.

This paper is part of a larger group of fun and remarkably successful work that tries to predict offline trends from online behavior. One example that recently received much press attention is Google Flu Trends, which uses searches for flu-related terms to predict real flu outbreaks. Another example is recent work out of Yahoo Research and Cornell to find the physical location of objects of interest from a combination of search terms and geolocation data.

Friday, November 14, 2008

Learning not to advertise

Andrei Broder and a large crew from Yahoo Research had a paper at CIKM 2008, "To Swing or not to Swing: Learning when (not) to Advertise" (PDF), that is a joy to see for those of us that are hoping to make advertising more useful and less annoying.

The paper starts by motivating the idea of sometimes not showing ads:
In Web advertising it is acceptable, and occasionally even desirable, not to show any [ads] if no "good" [ads] are available.

If no ads are relevant to the user's interests, then showing irrelevant ads should be avoided since they impair the user experience [and] ... may drive users away or "train" them to ignore ads.
The paper looks at a couple approaches on when to show ads, one based on a simple threshold on the relevance score produced by Yahoo's ad ranking system, another training a more specialized classifier based on a long list of features.

An unfortunate flaw in the paper is that the system was evaluated using a manually labeled set of relevant and irrelevant ads. As the paper itself says, it would have been better to consider expected revenue and user utility, preferably using data from actual Yahoo users. But, with the exception of a brief mention of "preliminary experiments ... using click-through data" that they "are unable to include ... due to space constraints", they leave the question of the revenue and user satisfaction impact of not showing ads to future work.

Please see also the related work out of Google on limiting the damage from irrelevant ads that I describe in my previous posts, "Advertising auctions and modeling externalities" and "Hal Varian on advertising auctions".

Please see also my summary of Andrei Broder's industry day talk at CIKM 2008 in my earlier post, "More on computational advertising".

Thursday, November 13, 2008

Baidu on next generation search

Baidu Chief Scientist William Chang gave an industry day talk at CIKM 2008 on the next generation of search where he predicted an industry-wide move toward personalized web search.

William first described two earlier generations of search, then said the upcoming third generation of search would be the "internet as a matching network". He expected to see an integration of search and recommendations, where recommendations are used to find related concepts, entities, phrases, documents, and sources.

As part of this, he expected to see a renewed interest in the diversity of search results -- he described it as relevance versus user satisfaction -- that appeared to be going down the path of information exploration and search as a dialogue to better help searchers with the task behind the search phrase.

William also briefly mentioned personalized spidering. While he did not elaborate, I would guess he meant software agents gathering, synthesizing, and summarizing information from several deep web sources to satisfy a complicated task.

This talk was one of the ones recorded by and should appear there in a week or so.

Please see also my earlier post, "Rakesh Agrawal at CIKM 2008", that summarizes Rakesh's predictions about a new interest in diversity and discovery in web search.

Please see also my earlier post, "More on computational advertising", which summarizes Andrei Broder's CIKM talk on integrating of advertising and recommendations.

Update: William Chang asked me to add that "this talk reflects my personal views on cutting-edge search, not necessarily the company's."

Wednesday, November 12, 2008

As the Internet grows, so does Google

In two insightful posts, "The Omnigoogle" and "The cloud's Chrome lining", Nick Carr cleanly summarizes how Google benefits from the growth of the Web and why expanding Web use makes sense as a large part of their business strategy.

Some excerpts:
[Google] knows that its future ... hinges on the continued rapid expansion of the usefulness of the Internet, which in turn hinges on the continued rapid expansion of the capabilities of web apps, which in turn hinges on rapid improvements in the workings of web browsers .... [Chrome's] real goal ... is to upgrade the capabilities of all browsers.

The way Google makes money is straightforward: It brokers and publishes advertisements through digital media. More than 99 percent of its sales have come from the fees it charges advertisers for using its network to get their messages out on the Internet.

For Google, literally everything that happens on the Internet is a complement to its main business. The more things that people and companies do online, the more ads they see and the more money Google makes.

Just as Google controls the central money-making engine of the Internet economy (the search engine), Microsoft controlled the central money-making engine of the personal computer economy (the PC operating system).

In the PC world, Microsoft had nearly as many complements as Google now has in the Internet world, and Microsoft, too, expanded into a vast number of software and other PC-related businesses - not necessarily to make money directly but to expand PC usage. Microsoft didn't take a cut of every dollar spent in the PC economy, but it took a cut of a lot of them.

In the same way, Google takes a cut of many of the dollars that flow through the Net economy. The goal, then, is to keep expanding the economy.

Tuesday, November 11, 2008

Testing rankers by interleaving search results

Filip Radlinski, Madhu Kurup, and Thorsten Joachims had a paper at CIKM 2008, "How Does Clickthrough Data Reflect Retrieval Quality?" (PDF), with a surprising result on learning to rank using click data.

Specifically, they found that, instead of testing two search rankers in a normal A/B test (e.g. 50% of users see ranker A, 50% see ranker B), showing all searchers an interleaved combination of the two possible search result orderings makes it much easier to see which ranker people prefer. The primary explanation the authors give for this is that interleaving the results gives searchers the easier task of expressing a relative preference between the two rankers.

Some excerpts from the paper:
Unlike expert judgments, usage data ... such as clicks, query reformulations, and response times ... can be collected at essentially zero cost, is available in real time, and reflects the value of the users, not those of judges far removed from the users' context. The key problem with retrieval evaluation based on usage data is its proper interpretation.

We explored and contrasted two possible approaches to retrieval evaluation based on implicit feedback, namely absolute metrics and paired comparison tests ... None of the absolute metrics gave reliable results for the sample size collected in our study. In contrast, both paired comparison algorithms ... gave consistent and mostly significant results.

Paired comparison tests are one of the central experiment designs used in sensory analysis. When testing a perceptual quality of an item (e.g. taste, sound) ... absolute (Likert scale) evaluations are difficult to make. Instead, subjects are presented with two or more alternatives and asked ... which of the two they prefer.

This work proposes a method for presenting the results from two [rankers] so that clicks indicate a user's preference between the two. [Unlike] absolute metrics ... paired comparison tests do not assume that observable user behavior changes with retrieval quality on some absolute scale, but merely that users can identify the preferred alternative in direct comparison.
Please see also my older post, "Actively learning to rank", which summarizes some earlier very interesting work by Filip and Thorsten.

Update: Filip had a nice update to this work in a SIGIR 2010 paper, "Comparing the Sensitivity of Information Retrieval Metrics". Particularly notable is that only x10 as many clicks are required as explicit judgments to detect small changes in relevance. Since click data is much easier and cheaper to acquire than explicit relevance judgments, this is another point in favor of using online measures of relevance rather than the older technique of asking judges (often a lot of judges) to compare the results.

Monday, November 10, 2008

Tasks, not search, at DEMOfall08

Philipp Lenssen points to a video of a DEMOfall08 panel on "Where the Web is Going" that included Peter Norvig from Google and Prabhakar Raghavan from Yahoo.

What is notable in the talk is that (starting at 12:32) Prabhakar and Peter agree that, rather than supporting only one search at a time, search engines will soon focus on helping people get a bigger task done.

Prabhakar says:
I think the next step that faces us ... is divining the intent of what people are doing [and] fulfilling their tasks, getting what they need to get done.

People intrinsically don't want to search. People don't come to work every day saying I need to search ... They want to run their lives.

The notion of a [mere] retrieval engine as the ultimate [tool] is incredibly limiting. We have to get much further along to task completion and fulfillment.

The current world of search engines [are] stateless ... In the act of completing a task -- booking a vacation, finding a job -- you spend hours and days start to end. In the process, you make repeated invocations of search engines ... And all the while the search engine has no state about you. It doesn't recognize that you are in the midst of a task.

What does it take to recognize an intent and synthesize an experience that satisfies an intent? So, if it is a vacation you are planning, it should say ... here is a package I recommend that is based on your budget, the fact that you have two kids, that you don't want to go to too many museums. That's the future we have to get to.
Peter says:
We have to get a lot farther in saying what is it that the user means both in ... tasks where there is a clear intent ... and [even] more so in cases where it is exploratory.

We see this all the time that the user has some area he wants to figure out -- let's say a medical problem -- and the user starts out by kind of floundering around not sure what to talk about and then he reads some document and then he starts to learn the lingo. And, now they say, I don't say funny red splotch, I use this medical term and now I'm on the right track.

We have to accelerate that process ... not make the user do all the work.
There is an interesting shift here from a model where each search is independent to one where a search engine may expect searchers to do multiple searches when trying to accomplish their tasks.

That new model could take the form of search as a dialogue (a back-and-forth with the search engine focused on helping you understand what information is out there), personalized search (re-ranking your results based on your past actions, interests, and goals), or recommender systems (helping you discover interesting things you might not know exist using what people like you found interesting). Most likely, I would expect, it would require a combination of all three.

Thursday, November 06, 2008

More on computational advertising

Yahoo Fellow, VP, and computational advertising guru Andrei Broder gave a talk at CIKM 2008 on "The Evolving Computational Advertising Landscape" with some notable details that were missing from his previous talks on this topic.

Specifically, Andrei described "the recommender system connection" with advertising where we want "continuous CTR feedback" for each (query, ad) pair to allow us to learn the "best match between a given user in a given context and a suitable advertisement". He said this was "closest to a recommender system" because, to overcome sparse data and get the best match, we need to find similar ads, pages, and users.

At this point, Andrei offered a bit of a tease that an upcoming (and not yet available) paper that has Yehuda Koren as a co-author will talk more on this topic of treating advertising as a recommender problem. Yehuda Koren recently joined Yahoo Research and is one of the members of the top-ranked team in the Netflix recommender contest.

Andrei continued on the theme of personalization and recommendations for advertising, talking briefly about personalized ad targeting and saying that he thought short-term history (i.e. the last few queries) likely would be more useful than long-term history (i.e. a subject-interest profile).

Andrei also talked about several other topics that, while covered in his older talks, also are quite interesting. He contrasted computational advertising and classical advertising, saying that the former uses billions of ads and venues, has liquid market, and is personalizable, measurable, and low cost. He described the four actors in an advertising market -- advertisers, ad agencies, publishers, and users -- and said they advertising engines have the difficult task of optimizing the four separate and possibly conflicting utility functions of these actors. He talked about the ideal of "advertising as a form of information" rather than as an annoyance, the key there being making it as useful and relevant as possible. And, he spent some time on mobile advertising, talking about the very interesting but slightly scary possibilities of using precise locations of individuals and groups over time to do "instant couponing" to nearby stores (where what we mean by nearby is determined by your current speed and whether that makes it clear that you are in a car), to recognize which stores are open and which stores are popular, to predict lifestyle choices and interests of individuals and groups, and to make recommendations.

This talk was one of the ones recorded by and should appear there in a week or so.

Please see also my previous posts on related topics including "Google describes perfect advertising", "Recommending advertisements", and "What to advertise when there is no commercial intent?"

Wednesday, November 05, 2008

Data is good, code is a liability

Googler Peter Norvig gave a talk at industry day at CIKM 2008 that, despite my fascination with all things Peter Norvig, almost frightened me off by including the phrase "the Ultimate Agile Development Tool" in its title.

The talk redeemed itself in the first couple minutes, citing Steve Yegge's "Good Agile, Bad Agile" and making it clear that Peter more meant being agile than Agile.

His core point was that "code is a liability". Relying on data over code as much as possible allows simpler code that is more flexible, adaptive, and robust.

In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.

For another example, Peter pulled from Jing et al, "Canonical Image Selection from the Web" (ACM), which uses a clever representation of the features of an image, a huge image database, and clustering of images with similar features to find the most representative image of, for example, the Mona Lisa on a search for [mona lisa].

Peter went on to say to say that more data seems to help in many problems more than complicated algorithms. More data can hit diminishing returns at some point, but the point seems to be fairly far out for many problems, so keeping it simple while processing as much data as possible often seems to work best. Google's work in statistical machine translation works this way, he said, primarily using the correlations discovered between the words in different languages in a training set of 18B documents.

The talk was one of the ones recorded by and should appear there in a week. If you cannot wait, the CIKM talk was similar to Peter's startup school talk from a few months ago, so you could use that as a substitute.