Thursday, May 29, 2008

Yahoo builds two petabyte PostgreSQL database

James Hamilton writes about Yahoo's "over 2 petabyte repository of user click stream and context data with an update rate for 24 billion events per day".

It apparently is built on top of a modified version of PostgreSQL and runs on about 1k machines. In his post, James speculates on the details of the internals. Very interesting.

Please see also Eric Lai's article in ComputerWorld, "Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest". On that, note that the Google Bigtable paper from 2006 says Bigtable handles "petabytes of data", so the Yahoo claim may depend on what you consider a database.

3 comments:

Otis Gospodnetic said...

So an obvious question is: why is Y! not using HBase for this? Isn't this *exactly* what it's designed for, it already is distributed and runs on top of Hadoop's HDFS that Y! is heavily involved in. So why not HBase? Not ready yet performance-wise?

That aside, I find is *extremely* interesting Y! chose PGSQL and not MySQL, even though they are otherwise a big MySQL shop (or at least I think so).

Greg Linden said...

Hi, Otis. Wild speculation on my part, but I suspect Yahoo is walking multiple paths here to see which works best for which cases.

So, some teams are using HBase, some are using massively partitioned MySQL, and this team is using a modified PostgreSQL.

Probably not a bad idea, is it? After all, these are new technologies, so it is unclear which will work best for which applications. Exploration probably is a good thing.

jeff hammerbacher said...

Otis: HBase is being developed outside of Yahoo. Yahoo is working on their own transactional storage solutions, YDOT (ordered table) and YDHT (hash table). These projects involve the search (YST/YSM), platforms, and research groups.

Furthermore, rhe Everest project comes out of SDS and is meant for analytical query processing, not transactional query processing. It's definitely not doing "exactly" what HBase was designed for.

It seems to be the next evolution of their Myna warehouse that they've built internally. At Yahoo, you can consider Hadoop a client of this warehouse.

I'd also like to point out that this project is probably properly considered to be built "under" postgres, not "on top" of postgres, since they've basically taken pg's query parsing and plan construction components and plugged them on top of a new storage manager and query execution engine.