Tuesday, September 19, 2006

Diagnosing failures in Amazon's systems

Robin Harris points out a paper, "Advanced Tools for Operators at Amazon.com" (PDF), that talks about tracking and diagnosing failures in Amazon.com's distributed backend of web services.

The paper describes some of the complexity of this task:
Complex dependencies among system components can cause failures to propagate to other components, triggering multiple alarms and complicating root-cause determination.

Maya is a new visualization tool ... [that] displays the dependencies among components as a directed graph. Each component -- web page, service, or database -- is represented as a black (healthy) or red (unhealthy) dot.


The operators thus clearly see which components that are reporting alarms may simply be suffering from a cascaded failure.
This reminds me of a post by Werner Vogels a few months ago. Talking about Amazon's rearchitecture efforts, Werner had said, "A service-oriented architecture would give us the level of isolation that would allow us to build many software components rapidly and independently."

In the comments, I asked, "Why do web services necessarily eliminate dependencies? Can you not have the same complicated nest of dependencies in a cloud of web services, all of them talking to each other in an incomprehensible chatter?"

I do worry that Amazon may have taken a complex web of dependencies in libraries and moved that to a complex web of dependencies in web services. Since web service calls look like heavyweight library calls to developers, it is not clear to me that web services necessarily reduce dependencies, not without a special effort to make sure they do.

Moreover, as this paper describes, there is additional complexity in a distributed system of web services, in particular the difficulty of debugging and profiling across remote calls.

I noticed the authors of this paper include the widely respected Professor Michael Jordan at UC Berkeley in addition to several people at Amazon.com. Very interesting.

It is unusual to see Amazon publish academic papers on their work. Historically, they have not been as open as Google or Microsoft Research. I am excited to see more papers coming out of Amazon describing the challenging problems they are solving internally.

[Robin Harris post found via Dan Creswell]

1 comment:

Robin Harris said...

Greg,

Amazon can't eliminate all dependencies. What they can do is choose the dependencies. Presumably what Werner is saying is that by building a core of high-availability web services developers can concentrate on using those services efficiently rather than having to program around their occasional absence.

Of course, this creates a potentially unstable house of cards. Yet over time, using Maya and other tools, Amazon should be able to wring the bugs out of each service and increase its availablilty to MVS levels.

Is this really so different from the internet's reliance on distributed services such as DNS?

Regards,

Robin Harris
StorageMojo.com