A paper out of Microsoft Research, "A Data Structure for Sponsored Search" (PDF), has a couple simple but effective optimizations in it that are fun and worth thinking about.
First, a bit of explanation. When trying to match advertisers to search queries, search engines often (90%+ of the time) use broad match, which only requires a subset of the query terms to match. For example, if an advertiser bids for their ads to show on searches for "used books", the ad might also show on searches for [cheap used books], [gently used books], and [used and slightly traumatized books].
You can do this match using a normal inverted index, but it is expensive. For example, on the search for [cheap used books], you need to find all ads that want any of the terms (cheap OR used OR books), then filter out ads that wanted other terms (e.g. an ad that bid on "new books" need to be filtered out at this step). For very popular keywords such as "cheap", many ads can end up being retrieved in the first step only to be filtered in the second.
The clever and nicely simple idea in this paper is to use past data on the frequency of keywords to drop popular terms from the index whenever we possibly can. For example, let's say an advertiser bids for "the great wall of china", we can index the ad only under the lowest frequency word (e.g. "china") and not index under any of the other words. Then, on a search for [the awesomeness], we no longer have to retrieve then filter that ad.
The authors then extend this to deal with bids where all the keywords are popular (e.g. "the thing"). In the paper, they discuss a model that attempts to estimate the cost of throwing multiple words into the index (e.g. indexing "the thing") versus doing the query for each word separately.
The end result is an order of magnitude improvement in performance. All the irrelevant data you get from indexing everything wastefully pulls in many ad candidates that need to be filtered. In this case as well as others, it very much pays off to be careful about what you index.
Please see also my earlier post, "Caching, index pruning, and the query stream", that discusses a SIGIR 2008 paper out of Yahoo Research that explores some vaguely related ideas on index pruning for core search.
Subscribe to:
Post Comments (Atom)
1 comment:
The record linkage literature is full of blocking strategies like the first one you described (must match on a low DF word).
If you need to match each word, then search engines will often do intersections starting with lowest DF terms so searches fail fast.
Google's actually slow if you give it half a dozen medium to low frequency words that show up in only a few documents together. For instance, I just tried [frappe moka banquette pomerol], which had 42 total hits and took 0.6s.
Nutch, for instance, has a built-in option to do phrase search for common stop-words for just the reasons this mentions (so it's no surpirse Yahoo! is thinking about it, because Doug Cutting, lead developer for Nutch, works there).
Post a Comment