Wednesday, July 30, 2008

Easy processing of massive data sets

Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou have an upcoming paper at VLDB 2008, "SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets" (PDF), that describes a parallel data processing tool "being used daily ... inside Microsoft" on "large clusters ... of thousands of commodity servers" over "petabytes of data".

Scope is similar to Yahoo's Pig, which is a higher level language on top of Hadoop, or Google's Sawzall, which is a higher level language on top of MapReduce. But, where Pig focuses on and advocates a more imperative programming style, Scope looks much more like SQL.

Some excerpts from the paper:
Internet companies store and analyze massive data sets, such as search logs, web content collected by crawlers, and click streams collected from a variety of web services .... Several companies have developed distributed data storage and processing systems on large clusters of shared nothing commodity services including Google's File System, Bigtable, MapReduce, Hadoop, Yahoo!'s Pig system, Ask.com's Neptune, and Microsoft's Dryad.

We present a new scripting language, SCOPE. Users familiar with SQL require little or no training to use SCOPE. Like SQL, data is modeled as a set of rows composed of typed columns. Every rowset has a well-defined schema ... It allows users to focus on the data transformations required to solve the problem at hand and hides the complexity of the underlying platform and implementation details.

SCOPE [also] is highly extensible ... users can easily define their own ... operators: extractors (parsing and constructing rows from a file), processors (row-wise processing), reducers (group-wise processing), and combiners (combining rows from two inputs) ... [which] allows users to solve problems that cannot easily be expressed in traditional SQL.
The paper goes on to provide many examples of code written in Scope, describe the underlying Cosmos append-only distributed file system, discuss query plans for executing Scope programs, and touch on some of Scope's compile and run-time optimizations.

Though I probably shouldn't, let me add that, in my experience, Scope is great fun. It's like all the data and computation I can eat. And I can eat a lot.

Please see also my past posts on related work, including "Yahoo, Hadoop, and Pig Latin", "Sample programs in DryadLINQ", "Automatic optimization on large Hadoop clusters", and "Yahoo Pig and Google Sawzall".

Please see also my Dec 2005 post, "Making the impossible possible", which quotes Nobel Prize winner Tjalling Koopmans as saying, "Sometimes the solution to important problems ... [is] just waiting for the tool. Once this tool comes, everyone just flips in their head."

1 comment:

Anonymous said...

Interesting paper. SCOPE looks like a good fit for Dryad. Thanks.

I wonder why they didn't publish elapsed time and just the ratio? After all, they published the hardware/OS combo (2.0Ghz Xeon, 8GB RAM, 4 500GB SATA drive (what's the RPM?) running WS2003 X64 SP1). Afraid of getting their ass kicked by an open source implementation running on Linux with similar hardware? :)