An upcoming paper, "The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM" (
PDF), makes some interesting new arguments for shifting most databases to serving entirely out of memory rather than off disk.
The paper looks at Facebook as an example and points out that, due to aggressive use of memcached and caches in mysql, the memory they use already is about "75% of the total size of the data (excluding images)." They go on to argue that a system designed around in-memory storage with disk just used for archival purposes would be much simpler, more efficient, and faster. They also look at examples of smaller databases and note that, with servers getting to 64G of RAM and higher and most databases just a couple terabytes, it doesn't take that many servers to get everything in memory.
An excerpt from the paper:
Developers are finding it increasingly difficult to scale disk-based systems to meet the needs of large-scale Web applications. Many people have proposed new approaches to disk-based storage as a solution to this problem; others have suggested replacing disks with flash memory devices.
In contrast, we believe that the solution is to shift the primary locus of online data from disk to random access memory, with disk relegated to a backup/archival role ... [With] all data ... in DRAM ... [we] can provide 100-1000x lower latency than disk-based systems and 100-1000x greater throughput .... [while] eliminating many of the scalability issues that sap developer productivity today.
One subtle but important point the paper makes is that the slow speed of current databases have made web applications both more complicated and more limited than they should be. From the paper:
Traditional applications expect and get latency significantly less than 5-10 μs ... Because of high data latency, Web applications typically cannot afford to make complex unpredictable explorations of their data, and this constrains the functionality they can provide. If Web applications are to replace traditional applications, as has been widely predicted, then they will need access to data with latency much closer to what traditional applications enjoy.
Random access with very low latency to very large datasets ... will not only simplify the development of existing applications, but they will also enable new applications that access large amounts of data more intensively than has ever been possible. One example is ... algorithms that must traverse large irregular graph structures, where the access patterns are ... unpredictable.
The authors point out that data access patterns currently need to be heavily optimized, carefully ordered, and must conservatively acquire extra data in case it is later needed, all things that mostly go away if you are using a database where access has microsecond latency.
While the authors do not go as far as to argue that memory-based databases are cheaper, they do argue that they are cost competitive, especially once developer time is taken into account. It seems to me that you could go a step further here and argue very low latency databases brings such large productivity gains to developers and benefits to application users that they are in fact cheaper, but the paper does not try to do that.
If you don't have time to read the paper, slides (
PDF) are also available that are very quick to skim from a talk by one of the authors.
If you can't get enough of this topic, please see my older post, "
Replication, caching, and partitioning", which argues that big caching layers, such as memcached, are overdone compared to having each database shard serve most data out of memory.
HT,
James Hamilton, for first pointing to the RAMClouds slides.