Wednesday, July 02, 2008

Hadoop and scheduling

A VLDB 2008 paper out of Yahoo Research, "Scheduling Shared Scans of Large Data Files" (PDF) looks at how "to maximize the overall rate of processing ... by sharing scans of the same file ... [in] Map-Reduce systems [like Hadoop]".

I felt the model used for simulations in the paper was a bit questionable -- seems to me the emphasis should be on newer data being accessed by many jobs simultaneously, most often by smaller jobs -- but the specifics of the solution probably are of less interest than the paper's general discussion of scheduling issues that come up as a large Hadoop cluster is put under load from many users.

Please see also my earlier post, "Hadoop summit notes", especially the problems with scheduling using Hadoop on Demand (which makes a cluster look like multiple virtual clusters) and the idea of sharing intermediate results (such as sorted extracts) from previous jobs on the cluster.

No comments: