Friday, September 03, 2010

Insights into the performance of Microsoft's big clusters

A recent article from Microsoft in IEEE Computing, "Server Engineering Insights for Large-Scale Online Services" (PDF), has surprisingly detailed information about the systems running Hotmail, Cosmos (Microsoft's MapReduce/Hadoop), and Bing.

For example, the article describes the scale of Hotmail's data as being "several petabytes ... [in] tens of thousands of servers" and the typical Hotmail server as "dual CPU ... two attached disks and an additional storage enclosure containing up to 40 SATA drives". The typical Cosmos server apparently is "dual CPU ... 16 to 24 Gbytes of memory and up to four SATA disks". Bing uses "several tens of thousands of servers" and "the main memory of thousands of servers" where a typical server is "dual CPU ... 2 to 3 Gbytes per core ... and two to four SATA disks".

Aside from disclosing what appear to be some previously undisclosed details about Microsoft's cluster, the article could be interesting because of insights into the performance of these clusters on the Hotmail, Bing, and Cosmos workloads. Unfortunately, the article suffers from taking too much as a given, not exploring the complexity of interactions between CPU, memory, flash memory, and disk in these clusters on these workloads, and not attempting to explain the many oddities in the data.

Those oddities are fun to think about though. To take a few that caught my attention:
  • Why are Bing servers CPU bound? Is it because, as the authors describe, Bing uses "data compression on memory and disk data ... causing extra processing"? Should Bing be doing so much data compression that it becomes CPU bound (when Google, by comparison, uses fast compression)? If something else is causing Bing servers to be CPU bound, what is it? In any case, does it make sense for the Bing "back-end tier servers used for index lookup" to be CPU bound?
  • Why do Bing servers have only 4-6G RAM each not have more memory when they mostly want to keep indexes in memory, appear to be hitting disk, and are "not bound by memory bandwidth"? Even if the boxes are CPU bound, even if it somehow makes sense for them to be CPU bound, would more memory across the cluster allow them to do things (like faster but weaker compression) that would relieve the pressure on the CPUs?
  • Why is Cosmos (the batch-based log processing system) CPU bound instead of I/O bound? Does that make sense?
  • Why do Cosmos boxes have more the same memory than as Bing boxes when Cosmos is designed for sequential data access? What is the reason that Cosmos "services maintain much of their data in [random access] memory" if they, like Hadoop and MapReduce, are intended for sequential log processing?
  • If Hotmail is mostly "random requests" with "insignificant" locality, why is it designed around sequential data access (many disks) rather than random access (DRAM + flash memory)? Perhaps the reason that Hotmail is "storage bound under peak loads" is that it uses sequential storage for its randomly accessed data?

Update: An anonymous commenter points out that the Bing servers probably are two quad core CPUs -- eight cores total -- so, although there is only 2-3G per core, there likely is a total of 16-24G of RAM per box. That makes more sense and would make them similar to the Cosmos boxes.

Even with the larger amount of memory per Bing box, the questions about the machines still hold. Why are the Bing boxes CPU bound and should they be? Should Cosmos boxes, which are intended for sequential log processing, have the same memory as Bing boxes and be holding much of their data in memory? Why are Cosmos machines CPU bound rather than I/O bound and should they be?

Update: Interesting discussion going on in the comments to this post.


Anonymous said...

2 cpus and 2-3 gb/core does not mean 4-6 gbytes/server.

Greg Linden said...

Ah, right, I was worried about that. The description was "dual CPU socket, about 2-3 Gbytes per core of DRAM".

I may have misinterpreted that. You're saying these are quad core CPUs, total of 8 cores, and a total of 16-24G of memory. That makes more sense.

Thanks, I'll add a note to the post.

Anonymous said...

Regarding Bing being CPU-bound: I attended a presentation by a researcher who did a project at Microsoft studying Bing's performance on a couple different types of CPUs: low-power, low-performance and high-power, high-performance. I can't remember his name, unfortunately, but I think he was from Harvard. He said the reason why Bing's index servers are so CPU-hungry is that they use neural networks to do ranking once the candidate results have been fetched. From the presentation it sounded like each query has a deadline and the CPU time devoted to each query is somewhat adaptive, where using more cycles when they are available yields better ranking.

Eas said...

My guess on the hotmail servers: most of the data is ice cold, and the rest never gets accessed enough to be worth caching. I/O ops per GB of data is probably quite low. If they are hooking 40 disks up per server, it seems pretty clear that they are optimizing for storage cost. At the same time, all those spindles should allow for fairly high total IOs/s though I'd guess that the controllers would be the bottleneck.

Not sure what to think about the rest of it.

checoivan said...

For mail storage there's the part of email metadata and the blob data. You can't keep blob data in memory since you're talking of multiple terabytes per cluster vs. a handful couple gigabytes of ram. Your suggestion does makes sense for frequently accessed data, but email is unique for every single user. Even a chain mail which multiple people receive has very different header data per user ,and usually people reads mail once and archive or delete. It's not a common case (and hard to predict) an email that a user keeps reading many times a day, so that it is kept in some sort of memcached storage.

Greg Linden said...

Thanks, Eas and Checiovan, it's a good point that it may just be too expensive to use anything other than disk for Hotmail. Nevertheless, I'd like to see more discussion of that in the article, including some thought about alternatives. I'd particularly like to see more justification of why massive arrays of 40 disks make sense when the system appears to have an I/O bottleneck.

By the way, Google apparently had problems with their mail system, GMail, too for exactly the same reason, that they layered a workload requiring random access on top of a data storage layer built for sequential access. More details on that here:

Joe Harris said...

Maybe you should turn you questions around. Why wouldn't Microsoft want to have all of their servers maxed out?

1) Given that CPU use is the de facto measurement of utilisation then they are actually running a very efficient operation. AFAIK, anything above 50% utilisation is pretty much gold standard.

2) Windows. Seriously, we know they dogfood Windows as much as possible and we know that Windows carries a performance burden. Presumably they're running the fabled "MinWin" but it's still going to cost them.

Greg Linden said...

Hi, Joe. I'd expect them to have their servers maxed out, but it is the way they are maxed out that I find surprising.

The Cosmos servers should be streaming huge amounts of data as fast as they can off disk, but they aren't able to that because the disk and network data have to wait for CPU to be free.

The Bing servers should be quick in-memory caches, doing nothing but responding immediately to a request for index data, but instead are waiting on CPU.

The Hotmail servers should be fast random access lookups for data, but instead spend all their time waiting on disk seeks.

Yes, I would expect the servers to be maxed out, but it is the way they are maxed out that I think is odd and requires more explanation.

Anonymous said...

regarding the comment about being CPU bound due to "neural networks". This would only be the case if they were performing training each time which wouldn't make any sense. A neural net is a supervised learning technique and classification is fast/low cost.

More than likely they are CPU bound due to Greg's comment about compression.

Mark said...

Bing does seem to use some pretty intensive learned ranking metrics (see the recent ICML workshop and the slides from the Microsost team). Those kind of models tend to be memory-hungry, but I would guess probably aren't the the main cause of CPU-load.