Thursday, September 29, 2005

Google's BigTable

Jeff Dean from Google will be giving a talk on October 18 at University of Washington on a new large scale distributed system at Google called BigTable.
BigTable is a system for storing and managing very large amounts of structured data. The system is designed to manage several petabytes of data distributed across thousands of machines, with very high update and read request rates coming from thousands of simultaneous clients.
Sounds very interesting. The talk will be broadcast live on the internet.

See also my previous posts on Google's cool distributed architecture including their file system, cluster, and the distributed data processing tools MapReduce and Sawzall.

By the way, if anyone can find a paper on BigTable, please let me know. I couldn't find one.

Update: The talk was interesting but a little different than I expected.

BigTable stores a distributed, replicated sparse matrix of data. For example, for their crawler, you might have a BigTable matrix with a row "com.cnn.www:WORLD/:http" that contains information about the world news page from CNN. A column for that row might be labeled "content:" and contain the content for that page. Another column might be "language:" and contain "EN" for English. BigTable allows each cell in the matrix to have timestamped data, so a history of changes for the cell can be maintained easily.

It is not, as I was first expecting, a structured distributed database like some Googlized version of MySQL Cluster. That's not what Google needs.

The kinds of data processing tasks that Google has to do everyday require extremely high performance and reliability, but only weak guarantees on data consistency. No database like this exists, so Google had to build their own, BigTable.

Looking at BigTable and Google's other tools, I think Brian Dennis was right when he called them "major force multipliers." Tools like these enable Google to move faster, build more, and learn more than their competitors.

Update: Andrew Hitchcock posted a nice summary of the talk.

Update: The talk is available on Google Video.

Update: Eleven months later, Google has published a paper on Bigtable.

9 comments:

Andrew Hitchcock said...

I'm excited, I'm definitely going to be there. I went to the Dean talk last year, just after reading the MapReduce paper. Most of the presentation was stuff I had already read online, but the part about the synonyms was neat and I had never seen that system demonstrated (it was their internal tools, it seemed).

I did a Google for [BigTable] and this was the top result. I poked around some and couldn't find much else on the internet. Is this the first big, public presentation of their system?

Otis Gospodnetic said...

I just finished watching this. Impressive and interesting stuff! I, too, am wondering where I could read more about this...

Andrew Hitchcock said...

I just wrote an overview of the talk. Since you saw the talk, if you read my overview, I'd love to hear about any mistakes I made.

Greg Linden said...

Thanks, Andrew. That's a great writeup!

James Thiele said...

There are a couple minor points I'd add to Mr. Hitchcock's overview.

The largest BigTable they have is currently about 200 TeraBytes, which is less than the petabyte design goal.

Another interesting data point (to me) was that Google Earth's satellite imagery is 100+ TB and growing as higher resolution pics become available.

As pointed out in the overview various in house BigTable users run on their own clusters. According to the talk these clusters range in size from ten or twenty to hundreds of machines.

As a personal aside, the infrastructure that Google is building for large scale distributed computing (Google File System, MapReduce, BigTable, etc.) is impressive.

Aram said...

Does anyone know why Google does not use Bigtable for Gmail?

Rob Kohr said...

@aram

How do you know they don't?

Aram said...

@ Rob

I don't remember where I read it for the first time. But I asked this question from Jeff Dean. He said maybe Gmail is older than bigtable. Which I don't think is the real reason, because Orkut is even older than Gmail, and it does use bigtable.
He also mentioned it could be because of the size of Gmail, not a suprise answer for me.

Edward J. Yoon said...

Bigtable for Gmail...

It seems very dangerous If they are working on the From/To/CC/Bcc links graph. :)