Thursday, August 31, 2006

Google Bigtable paper

Google has just posted a paper they are presenting at the upcoming OSDI 2006 conference, "Bigtable: A Distributed Storage System for Structured Data".

Bigtable is a massive, clustered, robust, distributed database system that is custom built to support many products at Google. From the paper:
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth.

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
The paper is quite detailed in its description of the system, APIs, performance, and challenges.

On the challenges, I found this description of some of the real world issues faced particularly interesting:
One lesson we learned is that large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures assumed in many distributed protocols.

For example, we have seen problems due to all of the following causes: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance.
Make sure also to read the related work section that compares Bigtable to other distributed database systems.

See also my previous posts, "Google's BigTable", "C-Store and Google BigTable", and "I want a big, virtual database".


Anonymous said...

Dang, and just last night I was reverse engineering out the slides from the BigTable video via screen snapshots.

Ah well, probably still some value in the visuals alongside the paper (which looks pretty detailed).


Greg Linden said...

Hey, Nick. Chubby is described in the paper. It is a distributed lock manager. From the paper:

Bigtable relies on a highly-available and persistent distributed lock service called Chubby [8]. A Chubby service consists of five active replicas, one of which is selected to be the master and actively serve requests. The service is live when a majority of the replicas are running and can communicate with each other. Chubby uses the Paxos algorithm [9, 23] to keep its replicas consistent in the face of failure. Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. The Chubby client library provides consistent
caching of Chubby files. Each Chubby client maintains a session with a Chubby service. A client's session expires if it is unable to renew its session lease within the lease expiration time. When a client's session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration.

Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Section 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each table); and to store access control lists.

Anonymous said...

I tried to carefully read this paper, there are a couple of things i still find hard to figure out:

- The difference between 'family column' and 'locality group'.
- What is a SSTable block ? a full SSTable ? or maybe a row ?
- At which level(s) is organized the replication ? at Tablets level (through GFS behavior) ? at BigTable cluster level ? both ? Then how is it managed ? GFS ? other ?
- It seems that Chubby overlaps a bit with GFS, is it distinct of GFS or is implemented on top of it ? I personnaly think that Chubby doesn't rely in anyway with GFS, because it deals with very small files and the distribution scheme seems to be different.

BTW one thing easy to understand through this paper are the use cases of MapReduce, which is mostly used to refine data from 'raw' BigTables to 'refined' Bigtables. The former are mostly internally used as repository without extensive loads, while the latter are serving external queries. One drawback is that the MapReduce jobs are batched and don't reflect the full true current state.

Greg Linden said...

Hi, Sebastien. Those are good questions.

Based on my interpretation of the paper, would say that:

A column family is a semantic grouping of related data. A locality group is a hint to improve performance, an indication of application data access patterns.

An SSTable block appears to be a 64k disk block stored in GFS.

I am a little confused on the replication as well. On the one hand, I thought the replication was automatically handled by GFS. On the other hand, the paper refers repeatedly to load balancing of the tablets by the tablet servers, which suggests that replication is also done at the tablet level. Perhaps both are true.

I think Chubby is independent of GFS. It is described as a "highly-available and persistant distributed lock service" that "consists of five replicas." The BigTable paper does reference an upcoming OSDI 2006 paper on Chubby, but it does not appear to be available yet.

Hope that is helpful!

Anonymous said...

if you haven't done so and if you are on a very huge DB project, you might want to investigate IBM INFORMIX XPS database (or even IDS)
This is VERY fast, not too much dependent on CPU cycles (i.e. scales good on slowisch CPUs like those found on UNIX hosts having upto 128+ CPUs)
Maybe XPS would be sufficient for google but considering license cost, IMHO it is cheaper by a magnitude to develop a thingy like gfs

Forgive my bad English

using INFORMIX technology now for
18+ yrs

darose said...

Re: the question of how they do replication, apparently Jeff Dean provided some additional details in a comment here:

Dragon said...

Thanks for your quick response.

Are GFS cohosted w/ BigTable, or GFS are separated host behind bigtable? Which layer is responsible for tablet replication? It will be lovely to understand their different responsibility.
11:39 AM
Jeff Dean said...

Are GFS cohosted w/ BigTable, or GFS are separated host behind bigtable? Which layer is responsible for tablet replication? It will be lovely to understand their different responsibility.

We don't have a requirement one way or the other, but our typical configuration is that tabletservers run on the same machines as GFS chunkserver processes for the underlying GFS cell. In some cases this allows us to avoid one network transfer for reads and writes (if one of the chunk replicas for an underlying SSTable is stored on the local chunkserver).

We don't allow replication of a tablet in our system, so at any given time, a tablet is being served by a single tabletserver. The master and tabletservers cooperate to make tablet migration decisions to handoff responsibility for a tablet from one tabletserver to another, and also to assign a new tabletserver to serve a tablet when a tabletserver fails.
11:46 AM