Thursday, December 06, 2012

Google and the right database for the job

I finally got a chance to read "Processing a Trillion Cells per Mouse Click", a paper out of Google presented at the recent VLDB 2012 conference.

It describes the rather cool PowerDrill column-oriented database at Google that is optimized for speed, x10-100 times faster than other column-oriented databases, and several orders of magnitude faster than MapReduce/Hadoop. But, of course, there are tradeoffs to get those speed gains, and the tradeoff PowerDrill makes is that it keeps a lot in memory, so it can only contain a fraction of the data of the other systems.

What is so interesting about this, and what other companies need to learn from this, is the way Google builds so many databases to analyze its massive log data. The goal is to let people find stuff in the logs as fast as possible. That means you need many tools, the right tool for the job.

Hadoop and similar systems allow you to scan massive amounts of log data but, c'mon, all of us know that the vast majority of Hadoop jobs ignore almost all of the data. Every one of these jobs starts by selecting out a couple of the columns, the same columns almost everyone else wants, and dropping everything else. Fire up your job, waste hours of time waiting for almost all the data from a full table scan to be thrown out, and finally you get the result.

Dremel and other column-oriented databases help a lot with this. If almost all log processing jobs only want a couple columns, a column-oriented database is designed to pull out just a few columns quickly, and it's going to be a lot faster.

PowerDrill goes a step further. If almost all log processing jobs only want the most recent logs and only a few of the columns, just create a database with only the most recent logs and a few of the columns. Add in a lot of carefully designed compression, sharding across a medium-sized cluster, and the ability to skip over much of the data when it isn't needed (instead of doing full table scans all the time), and you got yourself the ability to answer most questions people ask of the logs in seconds, not hours.

And that's the point. Build a system that can answer 90% of the questions people ask of the logs in seconds. Build another than can answer 90% of the remaining, harder questions people ask of the logs in minutes. Then have a system that primarily archives all the logs, but also can answer, given enough time and power, much more complicated questions people very rarely ask.

Those Google guys have many databases for asking questions of their logs. Maybe you should too.

Some excerpts from the PowerDrill paper:
The column-store developed as part of PowerDrill is tailored to support a few selected datasets and tuned for speed ... Our column-store relies on having as much data in memory as possible ... PowerDrill can run interactive single queries over more rows than Dremel, however the total amount of data it can serve is much smaller.

Consider a typical use case such as triggering 20 SQL queries with a single click in the UI. In our production system on average these queries process 782 billion cells in 30-40 seconds (under 2 seconds per query) .... Each month it is used by more than 800 users sending out about 4 million SQL queries ... scanning [the equivalent of] 525 trillion cells .... One of our top users ... [in] 6 hours ... [executed about] 12 thousand queries .... Our production system is running on well over 1000 machines, the distributed servers altogether using over 4T of main memory.

[PowerDrill] pushes the "interactivity limit" out significantly ... The majority of queries are fairly discriminative, similar, and uniform ... The store has only a few but often explored tables (as opposed to many tables that are not used very often) ... [For many common queries] our techniques push the limit of interactivity out by one or two orders of magnitude.

1 comment:

Greg Linden said...

If you like this, I recently did a short Google+ post about another Google database that might be of interest. The database is called Spanner, is used instead of shared MySQL for Google's advertising backend, and has some unusual and interesting work on very high quality time synchronization across the servers (and what having that lets them do that others can't).