Thursday, December 09, 2010

Papers on specialized databases at Google

Googlers have published two papers recently at academic conferences detailing new specialized databases that are heavily used within Google.

The first paper is a USENIX 2010 paper describing Percolator. Percolator is the database powering Caffeine, which is Google's new system to provide fresher search results by adding new documents and updates to documents to their search index in near real-time.

The Percolator paper is titled "Large-scale Incremental Processing Using Distributed Transactions and Notifications" (PDF). An excerpt:
We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.

Percolator is built on top of the Bigtable distributed storage system .... Percolator was built specifically for incremental processing and is not intended to supplant existing solutions for most data processing tasks. Computations where the result can't be broken down into small updates (sorting a file, for example) are better handled by MapReduce. Also, the computation should have strong consistency requirements; otherwise, Bigtable is sufficient. Finally, the computation should be very large in some dimension (total data size, CPU required for transformation, etc.); smaller computations not suited to MapReduce or Bigtable can be handled by traditional DBMSs.
Percolator is a specialized database that adds new consistency guarantees (as well as triggers, which they call "observers") to Bigtable. One thing that is interesting is how specialized this system is. Percolator, for example, is in no way intended for online operations and can, in some cases, delay a transaction for tens of seconds due to stray locks. But, that is fine for the near real-time search index update task for which it is designed.

The second paper is a VLDB 2010 paper on Dremel. Dremel is a column store database designed to be orders of magnitude faster for some interactive database queries than MapReduce.

This paper is titled "Dremel: Interactive Analysis of Web-Scale Datasets" (PDF). An excerpt:
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing.

Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce (MR) jobs, but at a fraction of the execution time. Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations .... Dremel provides a high-level, SQL-like language to express ad hoc queries. In contrast to layers such as Pig and Hive, it executes queries natively without translating them into MR jobs.
The paper includes some fun motivational examples describing how people use Dremel for rapid prototyping of new ideas. There is a huge advantage in spending just seconds rather than hours to examine the potential of a new feature for a classifier or a new signal for relevance rank. Dremel lets Googlers twiddle the feature multiple times to optimize it in just a few minutes, then run a big, multi-hour MapReduce job to get the final data, a huge advantage over rivals that might take days to do the same investigation.

Dremel, like all column stores, it works best when selecting just a few columns from the data, but that is a very common case well worth optimizing for. One fun aside briefly mentioned in the paper is that they see another order of magnitude or two speedup if they can stop after only looking at 98% of the data or so because waiting for straggler chunks causes big slow downs. So, if you are willing to have (unusually slightly) inaccurate results, you can get huge boosts in speed from stopping queries early. That is also not new, but again a very common case and worth thinking about.

In both of these papers, what I find so remarkable is how willing Google is to build specialized databases to speed up tasks. Percolator can only be used for tasks have huge data, strong consistency requirements, and can tolerate occasional latency of tens of seconds, but that is perfect for near real-time search index updates. Dremel can only be used for selecting a few columns of data, but it is extremely common that a MapReduce job wants to ignore almost all the data in each record. It reminds me of the specialized search index structures and kernel twiddles Google does, which are other interesting examples of the lengths Google is willing to go to maximize the usefulness of their internal tools and the performance of their systems.

Wednesday, December 01, 2010

Groupon is not Googly

It is being widely reported that Google is bidding as much as $6B for Groupon. From an article in the New York Times:
Google has offered Groupon $5.3 billion, with the promise of $700 million in performance bonuses for management.

It could ... give Google access to a large sales force ... Groupon has 3,100 employees ... [and] 35 million [Groupon] subscribers worldwide, with 17 million in North America.

11 months [ago] ... Groupon ... [had only] 200 employees ... [Now] about 1,000 people work in the Chicago office and some 2,000 more are spread across its sprawling worldwide network, which includes the employees of its recent international acquisitions, ClanDescuento and Citydeal.de — group-buying sites in Chile and Germany. According to Groupon, the company is adding more than 200 employees a month.
It is unclear to me why Google is so interested in this company, so much so that it is willing to pay nearly twice what it paid for DoubleClick.

Google helps people find information. It's mission is "to organize the world's information and make it universally accessible and useful." In advertising, Google helps people find millions of interesting small businesses and sort through billions of products to get what they need.

Groupon is very different. Groupon runs one deal per day per market. There is no technology at Groupon, no information finding, no long tail, no massive data stream. What there is is a large (but very recently hired) sales force, significant revenue, and a decent subscriber base. Those are the kinds of things that looks good to MBAs building spreadsheets and empires, but are not deeply aligned with Google's mission of organizing information and making it useful.

It looks like this is a done deal at this point. But Groupon is not Googly. I fear these cultures are not going to blend well when mashed together. I wish these two luck, but this looks to be more like the kind of union where the bride walks away with the house and the dog in a few years than one where both sides thrive.

Update: It appears it was not a done deal after all.