Tuesday, April 17, 2007

What is Microsoft's CloudDB?

It appears that Microsoft is working on a clone of Google's BigTable codenamed Blue/CloudDB.

Mary Jo Foley reports that "CloudDB is a file-system-based storage system ... 'Blue' also seems to refer to the query processor that builds on top of the cheap, file-system-based storage enabled by CloudDB."

That sounds quite similar to BigTable, a distributed database built on top of GFS.

It seems hard to track down any other information about this effort. I was able to find a Microsoft job posting for a senior SDE that says:
Our team is building a geo-distributed, reliable and high performance internet facing service. You will help internal and external partners to realize Microsoft's vision of seamless access to your data - anywhere, anytime, any device.

You will learn about the Blue/CloudDB storage system, you will work on an API definition, solve problems around security and Quality of Service guarantees.

Do you find it fascinating to build a service that is scalable, geo-distributed, has efficient queuing, load balancing? We will solve together problems about performance, reliability and intelligent optimization for different traffic patterns. The product will be used also by internal Windows Live Services and you will have the opportunity to interact with multiple groups in the company that need easy, efficient and simple storage in the cloud.

Join us and help Microsoft go head to head with Amazon, Google, and Yahoo in cloud storage.
From that description, it sounds like Microsoft's CloudDB may be intended to compete with several efforts, including Amazon S3, AOL's XDrive, the Yahoo-supported Hadoop, and Google's BigTable, Google File System, and GDrive.

I am not even sure if it is from Microsoft, but I did also find something that looks like a leak of some pieces of an internal document called "CloudDB Living Spec v0.51.doc". The web page mentions the needs of Microsoft services (including MSN Shopping and Fremont/Expo) and echos terminology used in Google's BigTable paper. Some selected excerpts:
SSTable's efficient data is less than 40%, column name occupy much room in each item. Maybe we should use the column id to reduce this part of overload.

Sparse columns baked into the app. This scenario arises when an MSN service defines a large set of potential attributes for a type of an object, but only very few of these attributes are expected to be set on any given object.

For example ... [in] MSN Shopping ... the total set of attributes that products can have (e.g. "Pixel Resolution") is very large, but any given product only has a few (a vacuum cleaner doesn't have 'Pixel Resolution').

[A] service wants to be able to add 'attributes' to people, without having to waste space in every person for every attribute ... The most prominent example of this is Freemont --- this service wants to allow users to add their own columns to their listings. The total number of columns used across all objects can be huge (millions).

Note: Google Base advertises support (and is optimized) for cases where even a single object can have millions of columns.


MeTheGeek said...

Very interesting information Greg.


Bala said...

Google's BigTable design might be adopted by Microsoft for CloudDB. The Family:qualifiers concept will e adopted here.
Microsoft is also venturing another product similar to Gdrive