Friday, November 12, 2010

An update on Google's infrastructure

Google Fellow Jeff Dean gave a talk at Stanford for the EE380 class with fascinating details on Google's systems and infrastructure. Krishna Sankar has a good summary along with a link to video (Windows Media) of the talk.

Some fun statistics in there. Most amazing are the improvements in data processing that have gotten them to the point that, in May 2010, 4.4M MapReduce jobs consumed 39k machine years of computation and processed nearly an exabyte (1k petabytes) of data that month. Remarkable amount of data munching going on at Google.

The talk is an updated version of Jeff's other recent talks such as his LADIS 2009 keynote, WSDM 2009 keynote, and 2008 UW CS lecture.

[HT, High Scalability, for the pointer to Krishna Sankar's notes]


Alice said...

39k machine years in one month...doesn't that mean almost half a million MR machines (assuming an unrealistic 100% utilization)? That's a lot of hardware.

Greg Linden said...

Well, on the one hand, the machines probably have four cores (so 1/4 the machines), but the average utilization rate is probably a lot lower than 100%, probably more like 20-30%. So, I'd guess that 500k+ machines is a decent rough estimate for machines dedicated to MapReduce in the Google data centers based on the data they released.

What do others think? Roughly 500k physical machines a good estimate?

Alice said...

I would guess they would have 8 to 16 a machine year the same as a core year? It's an odd unit of measurement.

Anonymous said...

Let's say 8 to 24 cores per system, and more like 1 million motherboards.