Comments on Geeking with Greg: Cassandra data store at Facebook

Hi Greg , This is implemented and is exposed via...

2008-08-20T10:52:00.000-07:00

Hi Greg ,

This is implemented and is exposed via the JAVA API , but not all of the thrift API's , for example if you look at batch_insert_blocking , it implements these semantics.

- Prashant

Hi, Prashant! Thanks for the clarification! That...

2008-08-20T07:15:00.000-07:00

Hi, Prashant! Thanks for the clarification! That all sounds like the right way to go.

On the option of waiting for a majority of N replicas to return successfully, the wiki says, "The comments indicate that we wait for a reply from 'X <= N' endpoints, but I don't see this in the code." I took that to mean that the wait for the majority of N nodes was not implemented. Is that incorrect?

Thanks again, Prashant!

The writes to the replicas are done immediately in...

2008-08-19T23:50:00.000-07:00

The writes to the replicas are done immediately in an asynchronous manner.
The client has the option of using an API to do blocking inserts which wait for the data to be written to a quorum or a majority of the replicas before returning.

In case of machine failures the data is repaired using techniques like read repair and hinting once the machine comes back up.

Hi, Avinash. Thanks for coming by!Ah, yes, I see i...

2008-08-18T18:24:00.000-07:00

Hi, Avinash. Thanks for coming by!

Ah, yes, I see in the slides a note that writes do go out to a transaction log immediately before going to memory. Then, later, the disk-based tables are updated in batch, right.

Sorry about the confusion, I'll update the post to clarify.

While you're here, can you answer another question? When are updates to replicas done? That doesn't seem to be clear from the slides or wiki. I'm curious how much of a window there might be for data loss if a machine croaks and doesn't come back up?

Writes are not cavalier. Every write is logged int...

2008-08-18T15:53:00.000-07:00

Writes are not cavalier. Every write is logged into a Commit Log at each replica. Only if this write is successful does the local replica update the in-memory copy. This makes sure that in case a server/replica crashes then it can be recovered from the Commit Log.

How does this compare with Hadoop?

2008-08-16T01:16:00.000-07:00

How does this compare with Hadoop?

Hey Greg - Not sure if you take "requests", but I'...

2008-08-15T20:44:00.000-07:00

Hey Greg - Not sure if you take "requests", but I'd love to hear your opinions and insights on triple stores (in the semantic web sense). You have any experience? Systems like Cassandra, BigTable, or Dynamo seem like the early throes of triple stores - they're only missing one more attribute.

Hi, Jeff. Thanks for coming by to chat about this...

2008-08-15T18:25:00.000-07:00

Hi, Jeff. Thanks for coming by to chat about this!

The part I'm confused about is what happens if you make a write, then the machine goes down before the commit log is written to disk. Is there a window where data loss could occur?

On waiting for a majority of N nodes to return successfully, the wiki (which it appears you wrote) says, "The comments indicate that we wait for a reply from 'X <= N' endpoints, but I don't see this in the code." I took that to mean that the wait for the majority of N nodes was not implemented. Is that incorrect?

Thanks, Jeff!

hey greg,actually, cassandra has a configurable co...

2008-08-15T18:06:00.000-07:00

hey greg,

actually, cassandra has a configurable consistency model for writes. if throughput is a priority, you can dispatch writes asynchronously to the n nodes managing the n replicas. failures will be caught on the next read. you can also execute a blocking write and wait for a majority of the n nodes to return successfully. either way, all writes persist to a redo log in addition to being inserted into the in-memory table structure.

also, if any of the nodes that manage the key or its replicas are down at the time of a write, the system will find another node to take the write, and take note of the downed node. if that node comes up, the data will be passed back to its appropriate owner. this is the "hinted handoff" algorithm detailed in the dynamo paper.

via the redo log and hinted handoff, the system is designed to never lose a write.

in addition, we're working to add support for more stringent consistency models, similar to the work done with megastore.

Hi!Have you looked at http://hypertable.org/ ?Are ...

2008-08-15T17:53:00.000-07:00

Hi!

Have you looked at http://hypertable.org/ ?

Are you still working from home? Ping me if you want to go get some tea.

Cheers,
-Brian

My first impression of Cassandra is that it is the...

2008-08-15T17:48:00.000-07:00

My first impression of Cassandra is that it is the love-child of Bigtable and Dynamo. You have Dynamo's gossip protocols and consistent hashing instead of BigTable's table servers. However, unlike Dynamo, where you can tune the number of writers who need to ack a write , Cassandra does seem a bit cavalier about writes.