Friday, August 15, 2008

Cassandra data store at Facebook

Avinash Lakshman, Prashant Malik, and Karthik Ranganathan presented a talk at SIGMOD 2008, "Cassandra: Structured Storage System over a P2P Network", that describes a Google Bigtable-like data store used actively by Facebook.

What strikes me as remarkable about the system is its rather casual treatment of writes. As far as I can tell, a write is only sent to memory on one box, not written to disk, not even written to multiple replicas in memory. That seems fine for log or click data, but, for the kind of data Facebook deals with, it seems a little surprising to not see a requirement for multiple replicas to get the write in memory before the app is told that the write succeeded.

The code for Cassandra is open source and there is a wiki that adds a few tidbits to the SIGMOD slides. Note that HBase is also open source and also is modeled after Google's Bigtable; HBase is layered on top of Hadoop and sponsored heavily by Yahoo.

Please see also James Hamilton, who posts a summary of the slides and brief commentary, and Dare Obasanjo, who offers more detailed commentary on Cassandra.

Please see also my post, "Highly available distributed hash storage from Amazon", about Amazon's Dynamo (PDF). There are similarities in the design and references in the wiki that suggest that Facebook's Cassandra was influenced by Amazon's Dynamo.

If you are interested in all things Bigtable, you might also enjoy Phil Bernstein's post, "Google Megastore", that summarizes a Google talk at SIGMOD 2008 on a storage system they built on top of Bigtable that adds transitions and additional indexes, among other things.

Update: Avinash Lakshman swings by in the comments and clarifies that Cassandra does have a commit log, so writes do go to disk on at least one machine immediately. A write first updates the commit log, then updates the memory tables, and, finally, in batch some time later, goes out to all the tables on disk.

11 comments:

Sriram Krishnan said...

My first impression of Cassandra is that it is the love-child of Bigtable and Dynamo. You have Dynamo's gossip protocols and consistent hashing instead of BigTable's table servers. However, unlike Dynamo, where you can tune the number of writers who need to ack a write , Cassandra does seem a bit cavalier about writes.

krow said...

Hi!

Have you looked at http://hypertable.org/ ?

Are you still working from home? Ping me if you want to go get some tea.

Cheers,
-Brian

jeff said...

hey greg,

actually, cassandra has a configurable consistency model for writes. if throughput is a priority, you can dispatch writes asynchronously to the n nodes managing the n replicas. failures will be caught on the next read. you can also execute a blocking write and wait for a majority of the n nodes to return successfully. either way, all writes persist to a redo log in addition to being inserted into the in-memory table structure.

also, if any of the nodes that manage the key or its replicas are down at the time of a write, the system will find another node to take the write, and take note of the downed node. if that node comes up, the data will be passed back to its appropriate owner. this is the "hinted handoff" algorithm detailed in the dynamo paper.

via the redo log and hinted handoff, the system is designed to never lose a write.

in addition, we're working to add support for more stringent consistency models, similar to the work done with megastore.

Greg Linden said...

Hi, Jeff. Thanks for coming by to chat about this!

The part I'm confused about is what happens if you make a write, then the machine goes down before the commit log is written to disk. Is there a window where data loss could occur?

On waiting for a majority of N nodes to return successfully, the wiki (which it appears you wrote) says, "The comments indicate that we wait for a reply from 'X <= N' endpoints, but I don't see this in the code." I took that to mean that the wait for the majority of N nodes was not implemented. Is that incorrect?

Thanks, Jeff!

Jonte said...

Hey Greg - Not sure if you take "requests", but I'd love to hear your opinions and insights on triple stores (in the semantic web sense). You have any experience? Systems like Cassandra, BigTable, or Dynamo seem like the early throes of triple stores - they're only missing one more attribute.

Anonymous said...

How does this compare with Hadoop?

Avinash Lakshman said...

Writes are not cavalier. Every write is logged into a Commit Log at each replica. Only if this write is successful does the local replica update the in-memory copy. This makes sure that in case a server/replica crashes then it can be recovered from the Commit Log.

Greg Linden said...

Hi, Avinash. Thanks for coming by!

Ah, yes, I see in the slides a note that writes do go out to a transaction log immediately before going to memory. Then, later, the disk-based tables are updated in batch, right.

Sorry about the confusion, I'll update the post to clarify.

While you're here, can you answer another question? When are updates to replicas done? That doesn't seem to be clear from the slides or wiki. I'm curious how much of a window there might be for data loss if a machine croaks and doesn't come back up?

Prashant said...

The writes to the replicas are done immediately in an asynchronous manner.
The client has the option of using an API to do blocking inserts which wait for the data to be written to a quorum or a majority of the replicas before returning.

In case of machine failures the data is repaired using techniques like read repair and hinting once the machine comes back up.

Greg Linden said...

Hi, Prashant! Thanks for the clarification! That all sounds like the right way to go.

On the option of waiting for a majority of N replicas to return successfully, the wiki says, "The comments indicate that we wait for a reply from 'X <= N' endpoints, but I don't see this in the code." I took that to mean that the wait for the majority of N nodes was not implemented. Is that incorrect?

Thanks again, Prashant!

Prashant said...

Hi Greg ,

This is implemented and is exposed via the JAVA API , but not all of the thrift API's , for example if you look at batch_insert_blocking , it implements these semantics.

- Prashant