Saturday, March 25, 2006

I want a big, virtual database

In my previous story about Amazon, I talked about a problem with an Oracle database.

You know, eight years later, I still don't see what I really want from a database. Hot standbys and replication remain the state of the art.

What I want is a robust, high performance virtual relational database that runs transparently over a cluster, nodes dropping in an out of service at will, read-write replication and data migration all done automatically.

I want to be able to install a database on a server cloud and use it like it was all running on one machine.

Yes, I realize how hard this is to engineer, especially if you deal well with nasty issues like intermittent slowness or failures of nodes, major data inconsistencies after a network partition of the cluster, or real-time performance optimization based on access patterns.

I still want it. And I'm not the only one.

See also my previous posts, "Google's BigTable" and "Clustered databases".

10 comments:

mb said...

Greg, in your BigTable post, you mention Jeff Dean's presentation on Google's BigTable at the University of Washington.

A one-hour video of that talk is on Google Video at http://video.google.com/videoplay?docid=7278544055668715642&q=bigtable

Greg Linden said...

Thanks, mb!

Anonymous said...

If you can find the document, look
for the Mariposa database project that I was part of in the mid-1990s at UC-Berkeley.

I had just come from the failed RdbStar project at DEC and had some ideas on how to structure a real distributed database.

Bob Devine
(bob.devine (at) att.net)

Luke Lonergan said...

We've solved the first stage of this problem, namely the static parallel problem.

With Bizgres MPP, we scale arbitrarily complex queries and data size over data segments that can be distributed across machines and within machines. Arbitrarily complex schemas are fine - which was the hard part that required tricky interconnect software to make happen.

The problem with the static decomposition approach is that when you add a new machine, it requires downtime to reconfigure the new distribution.

To extend our current approach we plan to allow:
1) dynamic data distribution that can accept a change in number of segments
2) statement throughput scaling as you add more segments and machines

To implement (1), we will implement a map through which the hash distribution of records can be decoded. This will allow dynamic movement of data throughout the system.

Our approach to (2) will allow execution planning and connection management to scale in the first stage. This will allow web databases of unlimited user community size. The second stage will tackle the scaling of actual transactions, which is much harder than statement scaling.

The key to all of this is the interconnect among segment databases, which will need to be low latency and fast - ideally a single ethernet cloud. We currently deploy on Gigabit Ethernet switches that can scale to huge sizes with low latency and full matrix bandwidth.

- Luke

Anonymous said...

Unfortunately that level of transparency is still a bit off for us... maybe some time in the future :)

Although we do have grand plans. The big gotcha's are going to be the speed of the interconnects though. Slow interconnect, bad performance. Of course, you can sort of optimise, but shuffling data around dynamically (in ways that can survive node and system failure) is a non-trivial task.

If you ignore the possibility of failure, things get a lot simpler :)

- stewart
http://www.flamingspork.com/blog/

Greg Linden said...

Hi, Stewart. Yep, absolutely, this is really hard.

It's a much harder problem than querying the search index shards on the Google Cluster because you have R/W transactional data and need to support and optimize arbitrary SQL queries.

It is tempting to simplify the problem by narrowing the features of the database; Google's BigTable is an interesting example of that approach.

The MySQL Cluster is an impressive first step, I agree, though many issues have been simplified by doing it in-memory and keeping the cluster size small.

For one of many examples of problems to come, as you said, communication between nodes will be a huge issue in larger clusters. Minimizing that communication is really tricky since you have to try to make the nodes as independent as possible, probably by moving data around in response to real-time query patterns. Fun stuff, but a really hard problem.

Adam said...

Oracle has made significant improvements in this regard with RAC. While not suited for Google-size databases, it does pretty transparently turn a rack of blades into a single database, with good scalability, reasonable performance, and fair reliability.

Anonymous said...

Oracle requires a single very high performance storage machine to acheive this.

I believe that it just takes time to build the type of database considering today's hardware cost.

Alaric said...

You're definitely not the only one...

I have a personal dream of such a distributed data store, too; I'd like this to be the default way of storing data. Filesystems, databases, the lot all replicated and distributed and virtualised.

Part of the problem is that there's various different approaches to the problem, each of which are efficient for certain kinds of operation and not for others. Quorum locking, extended virtual synchrony, etc.

I'm trying to catalogue all the approaches I can find, to see if I can deduce a 'grand unified' data model (I'm not convince tables with rows and columns are the best way to go, for a start; how about Prolog-style assertions?), in which data structures are declared by the programmer or user and tagged with 'hints': "This will be read about 100 times as often as it's written", etc. that the system can use to choose appropriate storage algorithms for each substructure in turn (possibly incorporating feedback from runtime profiling, too, for when the programmer or user was wrong). And allowing the application or user to explicitly request "guaranted up-to-date data" (requiring read locks and all that) or just "possibly a bit outdated data" (inviting the use of a lazy replica), making the application's consistency requirements explicit so the tradeoff between consistency and efficiency can be managed on a case by case basis.

Brad said...

I realize it's 2 years later, but I think you're getting your wish. Amazon has opened up its services, and Microsoft is following them into the space (with Windows Azure XTable stuff).
Sounds like you'll be able to order up a database that's as big as you want it. Although it won't be able to do everhting a normal RDBMS can do.