Tuesday, February 03, 2009

Details on Yahoo's distributed database

The good folks at Yahoo Research published a paper at VLDB 2008, "PNUTS: Yahoo!'s Hosted Data Serving Platform" (PDF), that has juicy details on the massively distributed database behind most of Yahoo's web applications.

An excerpt:
We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!'s web applications.

The foremost requirements of web applications are scalability, consistently good response time for geographically dispersed users, and high availability. At the same time, web applications can frequently tolerate relaxed consistency guarantees.

For example, if a user changes an avatar ... little harm is done if the new avatar is not initially visible to one friend .... It is often acceptable to read (slightly) stale data, but occasionally stronger guarantees are required by applications.

PNUTS provides a consistency model that is between the two extremes of general serializability and eventual consistency ... We provide per-record timeline consistency: all replicas of a given record apply all updates to the record in the same order .... The application [can] indicate cases where it can do with some relaxed consistency for higher performance .... [such as reading] a possibly stale version of the record.
Do not miss the related work section of the paper where the authors compare PNUTS to Google Bigtable and Amazon Dynamo, among other things.

When reading the paper, a couple things about PNUTS struck me as surprising:

First, the system is layered on top of the guarantees of a reliable pub-sub message broker which acts "both as our replacement for a redo log and our replication mechanism." I have to wonder if the choice to not build these pieces of the database themselves could lead to missed opportunities for improving performance and efficiency.

Second, as figures 3 and 4 show, the average latency of requests to their database seems quite high, roughly 100 ms. This is high enough that web applications probably would incur too much total latency if they made a few requests serially (e.g. ask for some data, then, depending on what the data looks like, ask for some other data). That seems like a problem.

Please see also my August 2006 post, "Google Bigtable paper", which discusses the distributed database behind many products at Google.

Please see also my earlier post, "Highly available distributed hash store at Amazon", on the distributed database behind some features at Amazon.com.

Please see also my earlier posts, "Cassandra data store at Facebook" and "HBase: A Google Bigtable clone".

Update: One of the developers of PNUTS commented on this post, pointing out that PNUTS performance is much better in practice (1-10ms/request) when caching layers are in place and making a few comparisons to Bigtable.


Unknown said...

Hi Greg -

You might be interested in a fairly long post I wrote on PNUTS a couple of months ago

The paper is interesting, but I'd rather have seen details about how the underlying message broker works to ensure ordered atomic broadcast - it seems that once you have this, the rest of the system is just engineering :)

Anonymous said...

The performance is surprisingly low compared with the published Bigtable numbers (with less powerful machines). Bigtable actually supports row level ACID transactions.

Looks like this is another Cassandra: looks good on paper, not up to snuff in practice.

Utkarsh said...

Disclosure: I am one of the developers on PNUTs.

Thanks for the post. While writing these research papers, we feel our duty to point out the absolute worst case. So the experiments were done with no locality in data access, causing a very poor cache hit rate, and hence higher latencies. If you had such a workload in production, you would provision more machines to spread the load, and hence bring the latency down. We were just trying to show what the limits are.

The system performance is mainly dictated by the quality of disks and the cache hit ratio. In reality cache hit ratios are much higher (of the order of 80%), and the disks are much better with battery-backed cache, RAID, and the works. Under those conditions, we break the 10ms/request barrier no sweat, and even get close to 1ms for fully cached workloads. So yes, it is suitable for web apps (otherwise I would be out of a job :-))

One of the main wins of PNUTS over BigTable is native support for multiple data centers. BigTable stores 3 copies of data in every data center. Yahoo sometimes needs a presence in as many as 10+ data centers. You do the math. Does one really need so many copies?

@Henry Thats a good suggestion, we should publish details about our message broker design. However, I would like to point out that Pnuts only requires that the messaging system deliver total order on messages published on *the same* topic from *the same* data center. This is a substantially easier problem to solve than building a totally-ordered atomic broadcast across data centers (which I agree, is close to impossible).

@Anonymous PNUTs supports row-level transactions too-- they are done through a get followed by test-and-set. This is nothing new: it is called optimistic concurrency control and has been around in database literature for ages, and is also used by BigTable.

Greg Linden said...

Thanks, Utkarsh, for coming by! Very interesting to hear those details.

And congrats for working on both PNUTS and Pig! Very exciting projects!

Unknown said...

@Utkarsh For the latency metrics in the evaluation, I have a simple question. Can you help me to clarify it?
Even if the parameter of the zipf distribution of requested records is 1, which are highly skewed, the average latency is 50ms for ordered table, which is not very impressive considering the good locality here. What's the root cause here? Is it affected by the 10% write requests which may be across data centers?

Utkarsh said...

@Schumi, Yup the average latency is skewed by the 20% of writes going cross-colo, and the lack of a battery-backed cache which forces us to sync to disk on writes.

Anonymous said...

@usriv: in the case of replication, IMO, the simplicity of Bigtable is a major win over PNUT (and Cassandra) -- it let the underlying DFS to handle appends to multiple replicas (e.g., the DFS can have two replicas in different racks in a data center and a third replica in another data center.) Decoupling DFS and DB layer simplifies development/debugging tremendously.

Also, the extra replicas are not always for redundancy/availability purpose only, but higher read throughput. Reread the original GoogleFS paper for details.

There is absolutely no excuse to cite cross data center replication for poor median latency. Using a Paxos compliant replication scheme, you can have local replicas be the majority to improve median latency. The median latency should be under 1ms.

By row level transaction, I meant setting *multiple* columns in a row atomically, which is explicitly not supported by PNUT, according the paper.

The horrible scanning performance also means that there is no chance PNUT can host Y's crawl DB without a major rewrite. It's also not amenable to any kind of analytics work that requires to use map-reduce for aggregation/sorting etc.

It's very disappointing to see later entries in scalable DBs to be much inferior (especially considering wide range of workloads, including scanning) to the current benchmark: Bigtable.

Utkarsh said...

@Anonymous Building over a DFS definitely has a simplicity advantage. But exposing the replication directly to the DB layer does open up optimization opportunities, that would otherwise be hidden if building on top of a DFS.

The latencies reported in the paper were averages and not medians. Hence they were skewed by the fraction of writes in our workload that originated from the non-master data center (non-paxos-majority data center in your terminology).

PNUTs *does* support atomically updating multiple columns in the same row. Sorry, if that was not clear in the paper.

Anonymous said...

My thoughts on PNUTS where I try to interpret their consistency model:


Brian Cooper said...

Hi folks,

I'm another member of the PNUTS group at Yahoo! This has been a very interesting discussion; I'm glad you folks are as interested in this stuff as we are.

Just to reiterate a few points that Utkarsh brought up:

- There's no free lunch for performance, and if you want consistency some writes will have to go cross-datacenter, increasing the average latency. Cross-datacenter communication is required because of our mastership protocol. Consider a user who's record is mastered in California. If she flies to Europe, or a network problem causes her request to be redirected to a datacenter on the East coast, then her write will originate in a non-master datacenter and be forwarded back to the master for that record. In practice this happens 10-20% of the time, just because of the way web requests happen.

Even if you had Paxos, and managed to put the "local" participants in the same datacenter, occasionally a write would originate in the non-"local" datacenter and pay the cross-datacenter latency to find enough members of the quorum. So this cost is really unavoidable.

- You could weaken the consistency, to something like "eventual consistency" (write anywhere and resolve conflicts later) or even "best effort" (write anywhere and don't worry about conflicts) and avoid ever paying the cross-datacenter cost. And in fact it is possible to turn off mastership in PNUTS. But then you need a resolution protocol, and until conflicts are resolved inconsistent copies of the data are visible to readers. So again there is no free lunch.

- Anonymous is write that this system is not as optimized for scans a la MapReduce. In fact, you make different architectural decisions if you are optimizing for low-latency updates to a geographic database than if you are optimizing for scanning for MapReduce within a single datacenter. We have been able to run Hadoop jobs over PNUTS and get performance that is pretty good, just not as good as a native store (HDFS) optimized for MapReduce. So if you want to transactionally update data with very low latency and occasionally run MapReduce, you can use PNUTS; if you want to always run MapReduce but don't need a lot of high performance updates, use HDFS.

- As Utkarsh has said, the hardware used for the paper is not as good as production hardware (e.g. no battery-backed write cache, and other limitations). We hope to publish some numbers from live workloads on the production system soon.