Friday, March 21, 2008

Designing for internet scale

James Hamilton wrote a LISA 2007 paper, "On Designing and Deploying Internet-Scale Services" (PDF) with a remarkable brain dump of good advice on building large scale services.

The paper does read like a laundry list, so let me point at what I consider the most important recommendations.

Expect failures and test for them, to the point of never shutting down the system normally -- "crash only software" -- and designing and testing your system so that the "operations team [is] willing and able to bring down any server in the service at any time without draining the workload first."

Automate everything and keep it simple. Failover should be automatic, nothing should require human intervention, and this should be constantly tested. Complexity and dependencies are your enemy when trying to manage all possible failure modes automatically, and James rightly insists on keeping it simple, to the point that he argues that anything that would add complexity without producing order of magnitude improvements should not be implemented.

Test live with rollback, versioning, and incremental rollouts. As long as problems can be quickly reverted, rolling out code live early and often is not expensive or dangerous. More dangerous is "big-bang changes" where many things change at once, making it hard to determine where problems lie and hard to revert to the older version of the code base.

While I agree with almost everything James wrote, I found a few areas where I disagree based on my own experience at Amazon.

First, James wrote that "we like shipping on a 3-month cycle." That hardly seems like early and often to me. I think code should be shipped constantly, multiple times a week or even multiple times a day across different services. On the Web, there is no reason to not ship frequently, and launching continuously forces developers to eliminate any dependencies with other code in their launch while correctly adding pressure on high quality tools for monitoring and debugging the health of the system, versioning, rapid rollbacks, incremental rollouts, and live experiments.

Second, James says we should not "affinitize requests or clients to specific servers". Depending on what he means here, we may disagree, because I think this potentially could conflict with simplifying database design. In particular, it is much easier to do database replication (and James acknowledges that "database scaling remains one of the hardest problems in designing internet scale services") if we stick a user to a specific server for their writes. Then, we can just guarantee the much simpler eventual consistency with rather loose requirements on what eventual means. The counter-argument here is that sticking users to specific servers can create hot spots, but I think that largely can be avoided if we pick the server to stick to for writes in part based on load.

Finally, James advocates for a manual "big red switch" that allows us to throttle our system manually. I think this is in conflict with the idea of automating everything. What I would prefer to see is that the system automatically responds to low load by using more resources and less load by using less, with monitoring, warnings, and ability to override. As James says, "People make mistakes. People need sleep. People forget things." I think it is optimistic to believe that, in the middle of a crisis, people could quickly decide how to optimally throttle back a system so it can maintain the highest level of quality of service at unusual load. But an automated system that has even limited understanding of the cost of its parts could do that.

In all, a great paper, thought-provoking and well worth reading.


Anonymous said...

By "affinitize requests or clients to specific servers", I'm sure that he refers to the common ad-hoc sharding that's not robust enough to handle system failures. Using DHT to hash to a replication group is often an OK solution (Amazon's Dynamo) but not the best to utilize system resource optimally.

New generation DBs like Bigtable handle the shard/range mapping dynamically base on load (similarly, Hypertable plans to use pluggable shard/range schedulers, including learning schedulers to rebalance load without affecting serving) is, IMHO, a more flexible and robust approach.

Anonymous said...

Absolutely agree that the paper is a good one. It's not often that you find in one place operational best practices like this.

I think that I agree on most if not all of his points.

About the 'big red switch' idea: no matter how awesome your capacity planning has proven itself in the past, there can be those times when instantaneous growth might hit your site unexpectedly, and you'll have to make a decision either for the site to be down, or be up with a reduced feature set.

I'm not suggesting that every type of site experiences massive spikes like that, but it can, and does, happen.

I think I agree with James that there are some times when manual work and thinking are needed, even if it's 0.01% of the site's uptime.

Leon Mergen said...


I've been a recent reader of your blog, and am very interested in the topics you talk about -- thanks for the posts.

Now, this paper touches the topic of "design for failure", where you are supposed to test your failure paths.

Currently in my company, we still operate on a very small scale, and have deployed monitoring systems that detect when a service has gone offline, and automatically fails over. However, there is a time difference between shutting down such a service, and the monitoring system detecting a shutdown.

Is this a design flaw in our system, or is this expected (I noticed that even hadoop uses a heartbeat system, which there may be a delay in detecting a service has gone down) ? If it is expected, this will of course disrupt the experience for users in this small timespan, and if you're supposed to test failure paths all the time when shutting down services.. you're disrupting your users/customers in the process.

Is this a trade off you're willing to make in exchange for continuously testing the failure paths, or are the failure paths at big companies like Microsoft a bit more sophisticated, where enough redundancy is built-in that the end-users will not notice ?

Unknown said...

The best way to avoid latency in failure detection is to make it a part of the call path of each request, i.e. let the first remote caller that finds his end-point down inform the monitoring system of the situation and engage a local, possibly non-optimal, fallback.

Then the monitoring system has time to reconfigure the call routing by taking that end-point out of order and restablish optimal routing (eventual consistency with defined SLAs).