Sunday, January 06, 2008

MapReducing 20 petabytes per day

Googlers Jeff Dean and Sanjay Ghemawat have an article, "MapReduce: Simplified Data Processing on Large Clusters", in the January 2008 issue of Communications of the ACM.

It is a great introduction to MapReduce, but what I found most interesting was the numbers they cite on usage of MapReduce at Google.

Jeff and Sanjay report that, on average, 100k MapReduce jobs are executed every day, processing more than 20 petabytes of data per day.

More than 10k distinct MapReduce programs have been implemented. In the month of September 2007, 11,081 machine years of computation were used for 2.2M MapReduce jobs. On average, these jobs used 400 machines and completed their tasks in 395 seconds.

What is so remarkable about this is how casual it makes large scale data processing. Anyone at Google can write a MapReduce program that uses hundreds or thousands of machines from their cluster. Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time.

It is an amazing tool for Google. Google's massive cluster and the tools built on top of it rightly have been called a "competitive advantage", the "secret source of Google's power", and a "major force multiplier".

By the way, there is another little tidbit in the paper about Google's machine configuration that might be of interest. They describe the machines as dual processor boxes with gigabit ethernet and 4-8G of memory. The question of how much memory Google has per box in its cluster has come up a few times, including in my previous posts, "Four petabytes in memory?" and "Power, performance, and Google".

Update: Others now have picked up on this paper and commented on it, including Niall Kennedy, Kevin Burton, Paul Kedrosky, Todd Huff, Ionut Alex Chitu, and Erick Schonfeld.

Update: Let me also point out Professor David Patterson's comments on MapReduce in the preceding article in the Jan 2008 issue of CACM. David said, "The beauty of MapReduce is that any programmer can understand it, and its power comes from being able to harness thousands of computers behind that simple interface. When paired with the distributed Google File System to deliver data, programmers can write simple functions that do amazing things."


Anonymous said...

The ACM charges for the full paper (linked-to in the article). You can get the paper free from google at MapReduce: simplified data processing on large clusters

Anonymous said...

Anonymous: The abstract from the paper you cite is different. Whether they've just updated the original for the ACM article I don't know!

Anonymous said...

FYI... I uploaded the ACM article on to blog:

Maxime said...

5 lectures about MapReduce and usage: Google: Cluster Computing and MapReduce