Thursday, November 19, 2009

Continuous deployment at Facebook

E. Michael Maximilien has a post, "Extreme Agility at Facebook", on blog@CACM. The post reports on a talk at OOPSLA by Robert Johnson (Director of Engineering at Facebook) titled "Moving Fast at Scale".

Here is an interesting excerpt on very frequent deployment of software and how it reduces downtime:
Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of system and surely fixing any bugs that would result from these frequent small changes.

Second, there is limited QA (quality assurance) teams at Facebook but lots of peer review of code. Since the Facebook engineering team is relatively small, all team members are in frequent communications. The team uses various staging and deployment tools as well as strategies such as A/B testing, and gradual targeted geographic launches.

This has resulted in a site that has experienced, according to Robert, less than 3 hours of down time in the past three years.
For more on the benefits of deploying software very frequently, not just for Facebook but for many software companies, please see also my post on blog@CACM, "Frequent Releases Change Software Engineering".

1 comment:

Unknown said...

>> "less than 3 hours of down time
>> in the past three years"

It's admirable that Facebook believes in rapid development and "good enough" testing.

However, I'm a Facebook application developer and wanted to say that the claim of 3 hours of downtime is very deceptive. Facebook applications contribute an enormous number of pageviews for Facebook. I wouldn't be surprised if application pages contributed at least 20% of all Facebook pageviews. And the stability of the Facebook Application Platform is FAR worse that webpages that are entirely owned by Facebook.

I've been on the platform since 2007 and there hasn't been a single month where the Platform didn't have some sort of degradation. Degradation happens in terms of API errors, buggy functionality, and HTTP 404s. Just last night they had lots of problems.