Tuesday, July 15, 2008

Automatic optimization on large Hadoop clusters

Chris Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava at Yahoo Research had a USENIX 2008 paper, "Automatic Optimization of Parallel Dataflow Programs" (PDF), that looks at optimizations for large-scale Hadoop clusters using Pig.

The paper says that it is only attempting "to suggest some jumping-off points for [optimization] work." With much of the paper spending a fair amount of time on "textbook ... optimizations" such as early projection and filtering, operator rewrites, and physical execution plans for joins, parts do read like a survey.

The part that looks at work sharing between jobs, however, goes beyond a survey and is quite interesting. I was particularly intrigued by the discussion of materialized views, which the authors suggest could be used to cache the results of common operations. One common early computation in these cluster is to extract a few fields from a file and then sort the extract based on one or more of the fields before doing additional processing. If that were cached, this could look a bit like lazily creating indexes over the data the first time they are needed. A fun idea, and one that could narrow the gap between Map-Reduce databases and more traditional databases.

One other interesting tidbit in the paper is that the authors argue that high level languages like Pig for programming large-scale parallel computations will become dominant over "direct Map-Reduce or Dryad programs" in a "mass migration" as "good optimization technology" is implemented, "analogous to the migration from assembly to C-style languages to Java-style languages."

Please see also a paper by some of the same authors, "Parallel Evaluation of Composite Aggregate Queries" (PDF). That paper looks at reducing expensive data combining operations in Hadoop clusters by implementing a "cross-node data redistribution strategy that takes into account the nested structure of a query ... such that aggregations can be generated locally" for a type of aggregation query.

Please see also my earlier post, "Hadoop and scheduling", that discusses another paper by some of the same authors and its look at a method of combining many jobs on a Hadoop cluster that all want to access the same raw data files.

1 comment:

Anonymous said...

"analogous to the migration from assembly to C-style languages to Java-style languages."

A much better analogy is the migration from CODASYL-style imperative queries to SQL-style declarative queries.

But then the web 2.0 crowd wouldn't have any idea what was going on.

Arg, get off my lawn, you damn kids.