Wednesday, May 14, 2008

Yahoo, Hadoop, and Pig Latin

Chris Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins from Yahoo have an upcoming paper at SIGMOD 2008, "Pig Latin: A Not-So-Foreign Language for Data Processing" (PDF), that details Yahoo's work to build a powerful parallel processing language on top of Hadoop.

Some excerpts:
We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce.

At a growing number of organizations, innovation revolves around the collection and analysis of enormous data sets such as web crawls, search logs, and click streams ... For example, the engineers who develop search engine ranking algortihms spend much of their time analyzing search logs looking for exploitable trends.

A Pig Latin program is a sequence of steps ... each of which carries out a single data transformation ... Writing a Pig Latin program is similar to specifying a query execution plan ... This method is much more appealing than encoding [a] task as an SQL query, and then coercing the system to choose the desired plan through optimizer hints.

Pig ... is fully implemented and available as ... open-source. [Pig is executed] on Hadoop, an open-source, scalable map-reduce implementation. Pig has an active and growing user base inside Yahoo! and ... [is] beginning to attract users in the broader community.
The paper provides a number of examples of what Pig code looks like and how it executes across a cluster. The related work section of the paper is excellent and should not be missed; it compares Pig, Bigtable, MapReduce, map-reduce-merge, Dryad, and Sawzall.

Please see also my previous posts on Pig and Yahoo's Hadoop clusters, "Yahoo Pig and Google Sawzall", "Hadoop Summit notes", and "Yahoo deploys large scale Hadoop cluster".

1 comment:

Justin A said...

There is also http://www.jaql.org/ which is still pretty new, but it looks neat :)