Tuesday, August 05, 2008

Caching, index pruning, and the query stream

A SIGIR 2008 paper out of Yahoo Research, "ResIn: A Combination of Results Caching and Index Pruning for High-performance Web Search Engines" (ACM page), looks at how performance optimizations to a search engine can impact each other. In particular, it looks at how caching the results of search queries impacts the query load that needs to be served from the search index, and therefore changes the effectiveness of index pruning, which attempts to serve some queries out of a reduced index.

An extended excerpt:
Typically, search engines use caching to store previously computed query results, or to speed up the access to posting lists of popular terms.

[A] results cache is a fixed-capacity temporary storage of previously computed top-k query results. It returns the stored top-k results if a query is a hit or reports a miss otherwise.

Results caching is an attractive technique because there are efficient implementations, it enables good hit rates, and it can process queries in constant time. ... One important issue with caches of query results is that their hit rates are bounded ... [by] the large fraction of infrequent and singleton queries, even very large caches cannot achieve hit rates beyond 50-70%, independent of the cache size.

To overcome this limitation, a system can make use of posting list caching or/and employ a pruned version of the index, which is typically much smaller than the full index and therefore requires fewer resources to be implemented. Without affecting the quality of query results, such a static pruned index is capable of processing a certain fraction of queries thus further decreasing the query rate that reaches the main index.

[A] pruned index is a smaller version of the main index, which is stored on a separate set of servers. Such a static pruned index resolves a query and returns the response that is equivalent to what the full index would produce or reports a miss otherwise. We consider pruned index organizations that contain either shorter lists (document pruning), fewer lists (term pruning), or combine both techniques. Thus, the pruned index is typically much smaller than the main index.

The original stream of queries contains a large fraction of non-discriminative queries that usually consist of few frequent terms (e.g., navigational queries). Since frequent terms tend to be popular, those queries are likely to repeat in the query log and therefore are typical hits in the results cache. At the same time, these queries would be good candidates to be processed with the document pruned index due to their large result set size. Therefore, document pruning does not perform well anymore when the results cache is included in the architecture.

We observed that the results cache significantly changes the characteristics of the query stream: the queries that are misses in the results cache match approximately two orders of magnitude fewer documents on average than the original queries. However, the results cache has little effect on the distribution of query terms.

These observations have important implications for implementations of index pruning: the results cache only slightly affects the performance of term pruning, while document pruning becomes less effective, because it targets the same queries that are already handled by the results cache.
It is an excellent point that one optimization can impact another, so analyzing performance and configuration twiddles in isolation of the other optimizations is likely to give incorrect results. The results in this paper could be problematic for past work on index pruning since it indicates that the query streams used in their evaluations may not be representative of reality.

No comments: