Comments on Geeking with Greg: Slides on LiveJournal architecture

I've been interested in how LJ scales for a while,...

2007-05-02T16:33:00.000-07:00

I've been interested in how LJ scales for a while, and I think half of their success is managing to avoid the database as much as possible for simple reads of information that can be gotten with less overhead. The use of Memcached is a great part of that - it literally stores the finished HTML, ready to be dropped into the outbound stream. When it is read first from the DB it's got to be processed to get it screen ready - including parsing custom html-style tags and other dynamic insertions.

If that had to be read from the database - even if it was in memory most of the time - it would be a significant overhead, instead they can usually just pick a block of HTML text out of memcached, indexed by the original user and post number - and of course the most read posts are almost invariably in the last 12-24 hours..

Though I'm not yet storing finished user profiles for the website I run (a dating site), I am caching the the number of users, and some other simple information, for 30 seconds between counts - and though the entire dataset is in memory, it's still saving a several million database requests a day - and previous experiments where I more aggressively cached user's raw profile data and saw a huge decrease in number of database hits, and the amount of network traffic dropped as much as 80% (at that time I was just caching it into shared memory on a single webserver, from a local network database server).

Because there was no database overhead, the pages were also faster to fetch. When I can cache - in memcached - the finished HTML, I expect further big reductions in time to produce a page.

As for database partitioning, they already are, with the user shards, p4, right hand side. The global database is a still potential problem, and I understand they've been trying to de-centralise that for a while, but with 12.8million journals and around 200,000 posts per day (stats here: http://www.livejournal.com/stats.bml) it takes a lot of thought.

One final thought about partioning data, bear in mind that any user could be viewing any other user's recent posts - that's a lot of unfinished data to throw around, and potentially hundreds of times per minute for popular journals, like JWZ's or RSS feeds for WilWheaton or xkcd.com.

Via dzone: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more -- include a slightly older, but more complete version (80 page, my personal favourite: page 8) of how Livejournal has been scaled up.

Hi, Mike. Sorry, I was not clear. I did mean tha...

2007-05-01T21:57:00.000-07:00

Hi, Mike. Sorry, I was not clear. I did mean that each partition would be replicated. It would look like the UserDB clusters on slide 4.

The big difference between that and having one large database with a master and many read-only replicas is that the big database probably will hit disk but the parititioned, replicated databases should serve out of memory.

That's a good point that read-only data that is infrequently updated, like Amazon's product data or Google's search index, may be handled differently. However, heavily partitioning read-only data to keep it in memory still makes sense. Google, for example, does that with its index shards on GFS.

For handling sudden traffic spikes, hotspot partit...

2007-05-01T21:54:00.000-07:00

For handling sudden traffic spikes, hotspot partitions, and failures,
it's much easier to add a bunch of
read-only replicas than to
repartition the data.

LiveJournal does use partitions.
They get two scaling knobs:
* # partitions scales with the number
of active journal writers
* # replicas scales with the amount
of incoming traffic.
* # replicas for a given partition
can be scaled up for hotspots.

In this case, the cache consistency
is no big deal, this is a blogging
system used by teenagers, not
PayPal.

/SH

What about the issue of availability - if a partit...

2007-05-01T20:21:00.000-07:00

What about the issue of availability - if a partition is unavailable, but the clients that update and the clients that read would be out of luck. If there are very many more readers than writers (like product data at amzn) you may want to decouple the systems purely for availability concerns.