Wednesday, October 21, 2009

Advice from Google on large distributed systems

Google Fellow Jeff Dean gave a keynote talk at LADIS 2009 on "Designs, Lessons and Advice from Building Large Distributed Systems". Slides (PDF) are available.

Some of this talk is similar to Jeff's past talks but with updated numbers. Let me highlight a few things that stood out:

A standard Google server appears to have about 16G RAM and 2T of disk. If we assume Google has 500k servers (which seems like a low-end estimate given they used 25.5k machine years of computation in Sept 2009 just on MapReduce jobs), that means they can hold roughly 8 petabytes of data in memory and, after x3 replication, roughly 333 petabytes on disk. For comparison, a large web crawl with history, the Internet Archive, is about 2 petabytes and "the entire [written] works of humankind, from the beginning of recorded history, in all languages" has been estimated at 50 petabytes, so it looks like Google easily can hold an entire copy of the web in memory, all the world's written information on disk, and still have plenty of room for logs and other data sets. Certainly no shortage of storage at Google.

Jeff says, "Things will crash. Deal with it!" He then notes that Google's datacenter experience is that, in just one year, 1-5% of disks fail, 2-4% of servers fail, and each machine can be expected to crash at least twice. Worse, as Jeff notes briefly in this talk and expanded on in other talks, some of the servers can have slowdowns and other soft failure modes, so you need to track not just up/down states but whether the performance of the server is up to the norm. As he has said before, Jeff suggests adding plenty of monitoring, debugging, and status hooks into your systems so that, "if your system is slow or misbehaving" you can quickly figure out why and recover. From the application side, Jeff suggests apps should always "do something reasonable even if it is not all right" on a failure because it is "better to give users limited functionality than an error page."

Jeff emphasizes the importance of back of the envelope calculations on performance, "the ability to estimate the performance of a system design without actually having to build it." To help with this, on slide 24, Jeff provides "numbers everyone should know" with estimates of times to access data locally from cache, memory, or disk and remotely across the network. On the next slide, he walks through an example of estimating the time to render a page with 30 thumbnail images under several design options. Jeff stresses the importance of having an at least high-level understanding of the operation of the performance of every major system you touch, saying, "If you don't know what's going on, you can't do decent back-of-the-envelope calculations!" and later adding, "Think about how much data you're shuffling around."

Jeff makes an insightful point that, when designing for scale, you should design for expected load, ensure it still works at x10, but don't worry about scaling to x100. The problem here is that x100 scale usually calls for a different and usually more complicated solution than what you would implement for x1; a x100 solution can be unnecessary, wasteful, slower to implement, and have worse performance at a x1 load. I would add that you learn a lot about where the bottlenecks will be at x100 scale when you are running at x10 scale, so it often is better to start simpler, learn, then redesign rather than jumping into a more complicated solution that might be a poor match for the actual load patterns.

The talk covers BigTable, which was discussed in previous talks but now has some statistics updated, and then goes on to talk about a new storage and computation system called Spanner. Spanner apparently automatically moves and replicates data based on usage patterns, optimizes the resources of the entire cluster, uses a hierarchical directory structure, allows fine-grained control of access restrictions and replication on the data, and supports distributed transactions for applications that need it (and can tolerate the performance hit). I have to say, the automatic replication of data based on usage sounds particularly cool; it has long bothered me that most of these data storage systems create three copies for all data rather than automatically creating more than three copies of frequently accessed head data (such as the last week's worth of query logs) and then disposing of the extra replicas when they are no longer in demand. Jeff says they want Spanner to scale to 10M machines and an exabyte (1k petabytes) of data, so it doesn't look like Google plans on cutting their data center growth or hardware spend any time soon.

Data center guru James Hamilton was at the LADIS 2009 talk and posted detailed notes. Both James' notes and Jeff's slides (PDF) are worth reviewing.


Jeff Kubina said...

Great post Greg; any chance his talk was video taped?

Greg Linden said...

Hi, Jeff. I don't think this LADIS 2009 talk was taped -- James Hamilton might know for sure -- but Jeff Dean's WSDM 2009 talk was taped and is available:

A fair amount of the material appears to be the same in the two talks.

Hyunsik Choi said...

Really informative post! Thanks a lot.

Harisankar H said...

A more recent video lecture done at Stanford is available at