Tuesday, June 27, 2006

Four petabytes in memory

One way to look at how the Google cluster changes the game is to look at how data access speeds change with a cluster of this size.

Google reportedly had an estimated 450k machines two months ago and adds machines at roughly 100k per quarter. In 2004, each of these machines had 2-4G of memory, and, two years later, likely are up to 8G standard.

That means that Google can store roughly 500k * 8G = 4 petabytes of data in memory on their cluster.

Four petabytes. How much is that?

It is twice the size of the entire Internet Archive, a historical copy of much of the Web. It is the same as the estimated size of all the data Google had in 2003 on disk.

It is a staggering amount of data. And it is all accessible at speeds orders of magnitude faster than those with punier clusters.

That is a game changer.


Anonymous said...

I believe Microsoft/MSN/Live were talking a couple of months ago about why they had a competitive advantage by starting later and having a completely 64bit setup before anyone else. How could Google have 8Gig of RAM if they didn't have 64bit? Were Microsoft wrong?

John K said...

Nice post - I agree with the 450k ballpark figure, but I think the memory is a bit lower. See here for my estimates and thoughts.

RichB: Google would likely use Linux kernels with PAE that would allow there 32bit servers to address up to 12-16GB of RAM without requiring 64bit CPUs...

Anonymous said...

Greg, I'm curious, what kind of things are made possible by this?

One might be online link analysis to re-rank the top N results, HITS-style. I guess that Teoma (Ask) probably uses a link network stored in this way to do this for quite some time now.

What else?

(Isn't it interesting that MapReduce, BigTable and GFS gained so much interest lately, when all of them are very disk-oriented?)

Greg Linden said...

Hi, Bernhard. That's a little hard to say.

Being able to access vast amounts of data orders of magnitude faster allows doing things that couldn't be done before. Analyses that were impossible before because they would take years can now be done in days. Online features that were impossible before because they would have taken minutes to respond can now be launched.

On your other point, I don't think this is in opposition to disk-oriented tools. Being able to store vast amounts of memory means you can operate much more quickly over frequently or repeatedly accessed data and data for online features with short access time requirements. Infrequently accessed data is still going to be on disk.

Anonymous said...

Actually, while GFS manages disks, it also uses any available server memory as a large disk cache - well, it is really Linux doing that, but net-net that's it. With locality of reference, they probably get a major benefit.

As for MS saying they have a late adopter advantage - isn't it odd that a software company would hang their hat on a hardware advantage? GFS is a major SW win for Google, and not easily or quickly replicated. See more on GFS at http://storagemojo.com/?p=88

Anonymous said...

GFS is a major SW win for Google, and not easily or quickly replicated

=> flash disks are going to kill this advantage. With an access time of few micro seconds, it's going to be the perfect hardware between ram and classical hard disk.

Anonymous said...

Hmm.. isn't that a bit far-fetched?

I don't think they add computers AND more ram to their systems.

In fact it does not even make sense, because a 32-bit-mainboard can't in and case take more than 4GB of ram. This means that to add more ram, they have to exchange the mainboard and cpu too. And this certanly does not make sense, because then you get a new system anyway.

So i'm thinking, this is where the 100k new machines per quarter come from: upgrading/replacing old machines, and even if not, you're adding additional 4GB per system that are impossibe to get added.

May i say "hype"? ;)

Greg Linden said...

Barefoot, what makes you think that Google is not using 64-bit boxes?

The Google BigTable paper says that Google currently is using machines with "two dual-core Opteron 2 GHz chips". The Opteron is a 64-bit chip.