Tuesday, September 19, 2006

Chubby: The Google distributed lock manager

Google has released a new OSDI 2006 paper, "The Chubby Lock Service for Loosely-Coupled Distributed Systems".

From the paper:
Chubby is a distributed lock service intended for coarse-grained synchronization of activities within Google's distributed systems.

Chubby has become Google's primary internal name service; it is a common rendezvous mechanism for systems such as MapReduce; the storage systems GFS and Bigtable use Chubby to elect a primary from redundant replicas; and it is a standard repository for files that require high availability, such as access control lists.
Chubby is a relatively heavy-weight system intended for coarse-grained locks, locks held for "hours or days", not "seconds or less."

The paper talks about many of the practical issues they encountered building this large-scale system. It is a good read.

There is one rather silly thing I cannot resist commenting on. I thought the somewhat negative tone the paper takes toward Google developers was amusing. For example, the paper says at various points:
Our developers sometimes do not plan for high availability in the way one would wish. Often their systems start as prototypes with little load and loose availability guarantees; invariably the code has not been specially structured for use with a consensus protocol. As the service matures and gains clients, availability becomes more important; replication and primary election are then added to an existing design.

Developers are often unable to predict how their services will be used in the future, and how use will grow.

A module written by one team may be reused a year later by another team with disastrous results ... Other developers may be less aware of the cost of an RPC.

Our developers are confused by non-intuitive caching semantics.

Despite attempts at education, our developers regularly write loops that retry indefinitely when a file is not present, or poll a file by opening it and closing it repeatedly when one might expect they would open the file just once.

Developers rarely consider availability. We find that our developers rarely think about failure probabilities.

Developers also fail to appreciate the difference between a service being up, and that service being available to their applications.

Unfortunately, many developers chose to crash their applications on receiving [a failover] event, thus decreasing the availability of their systems substantially.
Well, okay, on the one hand, I am sympathetic with the designers of Chubby when they say that all developers working on at Google need to be aware of the complex reliability issues in distributed systems.

On the other hand, I think developers just want to create a reliable distributed lock for their applications and don't care how it is done. They want the process to be transparent. Give me a friggin' lock already.

From the tone of these criticisms, I suspect Chubby is not doing enough to make that happen for Google developers. I would not be surprised if there were other competing distributed lock systems at Google created by people who find Chubby does not quite meet their needs.

But, in the end, these are the problems any company with many developers working on complex distributed systems will encounter. It is not surprising the geniuses at Google hit them too. But, I have to say, I did find the tone in this paper toward Google developers a little amusing.

See also my previous post, "Google Bigtable paper".

[Found via Dan Creswell]


Anonymous said...

Interesting, I just have a table on one database that I use for doing these types of operations.


Kevin said...

He's not chubby.. he's my cluster!

roberto.cr said...

clash of generations. pedantry vs immediacy. humans.