Thursday, July 28, 2005

Google Sawzall

Google Labs has a new paper out, "Interpreting the Data: Parallel Analysis with Sawzall".

Sawzall is a high level, parallel data processing scripting language built on top of MapReduce. The system allows Google to do distributed, fault tolerant processing of very large data sets.

Here's an excerpt from Section 14 "Utility" of the paper:
Although it has been deployed only for about 18 months, Sawzall has become one of the most widely used programming languages at Google.

One measure of Sawzall's utility is how much data processing it does. We monitored its use during the month of March 2005.

During that time, on one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x10**15 bytes of data (2.8PB) and wrote 9.9x10**12 bytes (9.3TB) ... The jobs collectively consumed almost exactly one machine-century.
2.8 petabytes in one month. Yowsers.

Data. Yummy, yummy data. Gimme, gimme, gimme!

Update: Brian Dennis describes Sawzall, MapReduce, and other powerful tools at Google as "major force multipliers."

No comments: