Monday, October 13, 2008

Challenges from large scale computing at Google

Google Fellow Jeff Dean gave a fun talk last week at University of Washington Computer Science titled "Research Challenges Inspired by Large-Scale Computing at Google".

The talk is a "collection of problems we think are interesting/difficult" and, since it is coming from Jeff, has a heavy bias toward infrastructure problems.

The talk starts with energy efficiency in large scale clusters. Jeff pointed out that most work on power optimization is on laptops, not servers, but servers in a cluster have drastically different power optimization needs. In particular, laptops optimize power by shutting down completely, but servers are often at 20%-30% utilization on CPU, memory, and disk, and it would be nice to have them use only 20-30% power in this state rather than about 80%.

At this point, I was wondering why they didn't just shut down part of their cluster to get the utilization of the remaining servers closer to 80% or so. Ed Lazowska apparently was wondering the same thing since, moments later, he asked why can't Google just use smarter scheduling to compress the workload in the cluster (and, presumably, then put the now idle part into a low power mode). Jeff said that didn't work because it would impact responsiveness due to locality issues. Jeff's answer was vague and I am still somewhat unclear on what Jeff meant, but, thinking about it, I suspect what he wants is to use all the memory across all the boxes in the entire cluster, have a box respond immediately, but still use a lot less power when executing no-ops than the 50% of peak an idle box currently uses. So, keeping all the memory on the cluster immediately accessible so we can maximize how much data we can keep in memory seems like it is a big part of what makes this a challenging problem.

Next, Jeff talked about the OS. He pointed out that the "design point of the original version of Linux is pretty far removed from [our] very large data centers" and wondered if an operating system would be designed differently if it was specifically made for running in a cluster of 10k+ machines. He gave a few examples such as not really needing paging to disk but maybe wanting remote paging to the memory of other machines, adapting the network stack to microsecond network distances between machines, and changing the security model to focus on isolating apps running on the same box to guarantee performance.

Moving up a level again, Jeff described wanting a consistent framework for thinking about and building distributed applications and the consistency of the data for distributed applications.

Up one more level, Jeff wanted people to start thinking of having very large scale systems of 10M machines split into 1k different locations and how these would deal with consistency, availability, latency, failure modes, and adaptively minimizing costs (especially power costs).

Finally, Jeff briefly mentioned very large scale information extraction, speech processing, image and video processing, and machine learning, mostly talking about scale, but also giving a few examples such as moving beyond N-grams to handle non-local dependencies between words and Google's efforts to understand the semi-structured data in tables in web pages and data hidden behind forms on the Web.

Coming away from the talk, the biggest points for me were the considerable interest in reducing costs (especially reducing power costs), the suggestion that the Google cluster may eventually contain 10M machines at 1k locations, and the call to action for researchers on distributed systems and databases to think orders of magnitude bigger than they often are, not about running on hundreds of machines in one location, but hundreds of thousands of machines across many locations.

The talk is available for download in a variety of formats. Light, enjoyable, and worth watching if you are interested in large scale computing.

8 comments:

jeremy said...

large scale information extraction, speech processing, image and video processing, and machine learning

I'm a bit surprised that he didn't spend more time talking about the actual usages to which such large scale clustering would be put, considering those applications are the very reason why you'd build large scale cluster to begin with.

Greg Linden said...

I was a bit surprised by that too -- in particular, there was nothing about search and advertising anywhere in the talk -- but Jeff is an infrastructure guy, so I think the focus of the talk was on infrastructure and algorithmic challenges.

It is true that the title seems to imply that it would be more general than that.

jeremy said...

What would have been interesting is some sort of discussion around how much the architecture itself is application dependent. Would you build one type of large scale architecture, if you were trying to really optimize for speech processing? Would you build another type of large scale architecture if you were trying to really optimize for image processing? For standard text IR? Etc.

It seems like each of these domains could have specialized requirements that might influence your architecture. What are the challenges, I wonder, in both staying as general as possible, but still being able to support as many difficult applications as possible.

Anonymous said...

I haven't watched the video so I don't know if he has addressed the challenges in putting this kind of infrastructure available in a university setting.

The "entrance cost" to this kind of research is getting higher and higher. It would be great if major players in this field developed some kind of interface with universities.

Doug Cutting said...

Another reason you can't shut down part of the cluster is that you need the data on its disks. If a computation's performance depends on having a copy of its input on the same rack, then you cannot shut down half the rack. Nor can you shut down half the cluster, since you may so remove all replicas of some data.

dorion said...

@Doug: true that you can't shut down the machines that have the copies of the data, but this assumes you don't make multiple copies of the data. If you were to optimize for the opportunity to shut machines down to conserve power, you might develop a bias for distributing the same data to more nodes thus increasing the chances that a node could be powered down if another node had more 'active' versions of the data. Given a node that is operating at 30% but has data for several jobs, it could be 'boosted' to process more of those data jobs and free the machines with less demand for power conservation.

Borislav Agapiev said...

Great post, regarding 10M machines (or even 100M?) at 1K locations, how about 10M machines @ 10M locations?

Disclaimer: I have a bias, we are working on it :) ...

Now about server utilization, what is important and not well understood is that power saving techniques for laptops are very low level (circuit) e.g. clock-gating, meaning very difficult and effort intensive.

Modern CPUs have multiple cores which are poorly utilized, in
addition even in active cores you have superscalar pipelines which are poorly utilized themselves e.g. 5 pipes and it basically never fully uses even two.

The point is that there is tremendous amount of microarchitecture-level parallelism in there to wring out last bits of performance but the flip side of that is you have tremendous amount of circuitry doing nothing most of the time and it is extermely difficult to design it so it is all gated by clocks that can shut down.

Trust me on that, when I worked @ Intel in microprocessor design, adding clock gating introduced all kinds of messy issues e.g. it would turn a simple combinational circuit (i.e bunch of AND-OR gates) into a messy dynamic circuit with timing issues related to clock.

Otis Gospodnetic said...

I have not watched the video yet, but I suspect the locality that Jeff refers to is what Doug referred to. Think Hadoop and MapReduce and the benefit of moving the process to wherever data lives. Yes, data could be replicated and you could then try to shut down only some of the servers hosting the replicas. But wait! The server you just shut down also hosts a replica of another segment of data, so maybe you can't shut it down. Now you have to have a complex mechanism for allocating segments to machines in order to avoid the above.

How about shutting down only a portion of the cores in each server? Shouldn't that be "easy"?