Wednesday, October 10, 2007

Highly available distributed hash storage from Amazon

Amazon CTO Werner Vogels announces an awesome SOSP 2007 paper he co-authored with nine other people at Amazon, "Dynamo: Amazon's Highly Available Key-value Store" (PDF).

Like the Google File System and Google Bigtable papers, this Amazon paper describes a critical part of Amazon's infrastructure, a reliable, fast, and scalable storage system.

Some excerpts from the paper:
There are many services on Amazon's platform that only need primary-key access to a data store ... [and] using a relational database would lead to inefficiencies and limit scale and availability.

Dynamo uses a synthesis of well known techniques to achieve scalability and availability: Data is partitioned and replicated using consistent hashing, and consistency is facilitated by object versioning. The consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol. Dynamo employs a gossip based distributed failure detection and membership protocol ... Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution.

In the past year, Dynamo has been the underlying storage technology for a number of the core services in Amazon's ecommerce platform.
The paper overall is fascinating, not only for the discussion of Amazon's needs and how the authors thought about them when building Dynamo, but also for critical discussion of other distributed databases. The related work section (Section 3) is worth its weight in gold.

A paper on a Amazon distributed database immediately invites comparison with the Google Filesystem and Bigtable. Unlike Google FS (and also unlike distributed databases like C-store), Amazon's Dynamo oddly is optimized for writes. From the paper:
Many traditional data stores execute conflict resolution during writes and keep the read complexity simple.

Dynamo targets the design space of ... a data store that is highly available for writes ... We push the complexity of conflict resolution to the reads in order to ensure that writes are never rejected.
This choice surprises me and does appear to lead to some problems later. The 15ms average read latency seems high if any service needs to make more than a dozen database queries when serving a web page; latencies like that make this a pretty heavyweight data store service.

As they described in the paper, at least a few groups at Amazon needed a much lighter weight service that could be hit thousands of times, so they had to use rather extreme parameters (requiring all replicas be available for writes, and only one for reads) to force Dynamo to work for them. With those parameters, they effectively turned off the high availability for writes and pushed the complexity of conflict resolution back away from reads, which makes me wonder if write optimization really should have been part of Dynamo's design goals in the first place.

Another difference with Google FS and Bigtable is that Google's systems organize data as a map, supporting high performance range scans over the data. On the one hand, Google may have more need for that, with its building of search indexes and analyzing massive text and log data. On the other hand, Amazon has massive text and log data too, and Dynamo seems like it may not be able to help with large scale data indexing and analysis tasks.

On both not supporting range scans and the optimization for writes over reads, the source of that appears to be that the authors focused on the needs of the shopping cart. They repeatedly return to that as a motivating example. It is not clear to me why they choose to focus on that task over their other needs.

I had a couple other surprises. Dynamo relies on random, uniform distribution of the keys for load balancing -- something that seems likely to run into problems with highly skewed access patterns -- rather than supporting additional replication of frequently accessed data. More serious, Dynamo is limited to a few hundred nodes because they punted on some of the hard problems of ensuring consistency in metadata (like their routing table) at larger scale.

Overall, a very interesting paper and system from Amazon. I love how Amazon has adapted the motivation of P2P distributed hash tables like Chord and Pastry to an environment with all trusted machines like an Amazon data center, taking advantage of that to reduce latency and improve reliability. I also am impressed by how remarkably configurable Dynamo is -- from the underlying storage to the number of replicas to the means of conflict resolution -- so that it can adapt to the widely varying needs of different groups at Amazon.

By the way, unlike Google, Yahoo, and Microsoft, Amazon publishes academic papers only rarely. They deserve kudos for doing so here. With this paper, Amazon is revealing some of the remarkable challenges in large scale computing they face. As people are attracted to those challenges, perhaps this will be the start of more openness from Amazon.

6 comments:

Anonymous said...

I hope that someone does an open source version of this (I'm not knowledgeable enough to or I'd join).

It would be even better if said project let you tune the params to trade off write avail vs read perf.

Anonymous said...

Regarding publishing papers, you have Microsoft way ahead, then Google and Yahoo, and then Amazon. Microsoft is the only one that has a real Research Center, whose goal is purely to advance computer science.

Google and Yahoo, although having large research labs are mostly concerned with science application and have big disclosure issues.

Your blog is great.

Kevin Merritt said...

Thanks for the summary of Dynamo, Greg. Solving the big, distributed database problem is interesting, challenging and fun. I know you've summarized in the past, but your readers might not be aware that there is an open source version of GFS and BigTable in the form of Hadoop's HDFS and the HBase project. Like BigTable and Vertica, HBase is a column store which optimizes for reads, not writes.

I thought initially that Amazon's design choices were odd. Optimizing for writes especially seems odd for a website that must have 99% read-to-write ratio, but then the thought came that they probably have a tiered architecture that allows them to create a virtual infinite number of read-only nodes with their catalog cached therein. When someone initiates a transaction (starts to buy something or starts to enter a comment or review), they dynamically context switch the session to a DB node that's open for writes. Makes sense.

Finally, it's interesting to note how most of the really large distributed stores rely on the simple key-value pair model. Google proved with MapReduce that you can get a lot of meaningful work done if you can describe your problems in terms of keys and values. The main point for other readers is that key-value pair structures are great if you know that you'll be accessing data through the key, but dramatically limit and/or curtail a lot of ad hoc querying when you want to explore data through the value side of the pair.

Keep up the great analysis.

Anonymous said...

Re the p2p distributed hash tables aspect, here's a recent talk at Google:

http://video.google.com/videoplay?docid=-7096662377647111009

LH*RSP2P: A Scalable Distributed Data Structure for P2P Environment

xasima said...

2 kevin merritt:

definitely, they seem to optimize write operation due to ability to front-ending any Disk Based Hash / DB with the number of caches.

I've some thoughts on the cache hierarchy idea, so I think that optimizing your DB for a number of scattered small writes and providing the mature distributed / local caches is the good solution for any projects that offer user to continuously change (upload) the data.

I'm afraid that WRITE ONCE - Read many (BigFiles in GFS / BigTable) is the good solutions for search, not for evolving data.

Siculars said...

Greg,

I think your analysis is too harsh against this technology. The paper states clearly in many places that their approach is what was needed for their environment. And as far as their need was concerned they did smashingly well in addressing it.

A major difference between the Dynamo and GFS/Hadoop implementation is the uniformity of Dynamos servers verse the namenode/datanode concept of the later. Uniformity/Symmetry of the servers in the Dynamo approach has a certain elegance to it that has a nice feel to it in my book. Again, six of one, half dozen of the other. As always, depends on your application needs. This symmetry is accompanied by a chatter protocol that keeps nodes in sync with their peers. Very cool.

Additionally, the version scheme of the written data set with the vector clock and subsequent conflict resolution is also interesting and worth further analysis.

Overall, the non-block on write architecture is deserving. I would compare this approach with the whitepaper recently released by Vertica, the column oriented datastore from the creator of postgres. They are also optimising for non-block writes.

In the end whats to stop an organization from employng multiple solutions (both non-blocking writes and optimized read systems) for different parts of the overall business. Not much, really.

Although we disagree on some points, I thank you for your analysis.

~Alexander Sicular