Sunday, July 15, 2007

Caching and search engine performance

Ricardo Baeza-Yates and others crew from Yahoo Research have a paper at the upcoming SIGIR 2007 conference, "The Impact of Caching on Search Engines" (PDF).

Some excerpts:
[Caching] is either off-line (static) or online (dynamic). A static cache is based on historical information and is periodically updated. A dynamic cache replaces entries according to the sequence of requests.

There are [also] two possible ways to use a cache memory: caching answers ... [and] caching terms ... Returning an answer to a query ... in the cache is more efficient than computing the answer using cached posting lists ... [but] previously unseen queries occur more often than previously unseen terms, implying a higher miss rate for cached answers.

Our results show ... caching terms is more effective with respect to miss rate, achieving values as low as 12%. We also propose a new algorithm for static caching of posting lists that outperforms previous static caching algorithms as well as dynamic algorithms such as LRU and LFU, obtaining hit rate values that are over 10% higher compared these strategies.
For me, the most interesting result in the paper is the apparent viability of static caching. Offhand, I would not have expected static caching -- merely pre-caching items based on old query log data -- to be as effective as they claim it would be.

Frankly, if I were given this particular caching problem, my inclination would be to immediately leap to an implementation using dynamic caching. This paper argues that would be foolish. If you are like me, perhaps there is a general lesson in this, that we should not be so quick to dismiss static caching.

I think there also is another question to ponder that is not directly addressed in this paper. What should the size of a caching layer be?

In the paper, very large cache sizes are considered, as much as 100% of the index size. However, the paper seems to treat decision about the resources (the number of machines) allocated to the caching layer as independent of the number of machines allocated to the index layer.

What is more realistic is a situation where I have N machines and X of them can be allocated to the index layer and (N-X) of them can be allocated to the caching layer.

Why does this matter? The more machines I allocate to the index layer, the faster the index layer becomes, because more index shards fit in memory. The number of machines allocated to caching cannot be considered in isolation of the number of machines allocated to the index shards. Shifting machines from one to the other changes the performance characteristics of both.

Expanding this out from search engines, the same question applies to database layers. Often, the reaction to a slow database is to add layers of caching. In some cases, these caching layers grow to where the number of machines devoted to caching exceeds the number of machines in the database layer.

At that point, the question has to be asked, would those machines be better in the database layer? How much faster could the database be if each of the database shards could be kept entirely in memory? Would most or even all of that increasingly convoluted caching layer suddenly become unnecessary?

Going back to the paper, I think it is a worthwhile read, not only for the research work reported, but also for the thoughts it provokes. Caching may seem simple, but getting it right is not. There are more questions to ponder here than at first it might appear.

Update: Chad Walters (former architect on Yahoo web search, now at Powerset) posts some interesting thoughts on this paper and some of the practical issues that he has seen come up when considering static versus dynamic caching.


Anonymous said...

Interesting post. This was particularly relevant and timely for me as we are currently working on some performance improvements for the opportunity search caching system used on PartnerUp.

Our opportunity search is somewhat different than a standard search because it is based on more of a dimensional search methodolgy. So, we have been working at ways that we can cache a "base" layer of results, which can then be pulled into the complete dimensional results specific to each users search.

The article has given me a few fresh ideas that I'll be thinking about.

Avi Rappoport / said...

This fits in the long tail scenario -- If you can cache your short head, you'll see a significant improvement in responsiveness. Who wants to go query for Britany Spears every single time?

Thanks for this post, very useful (I'm linking to you).