Thursday, April 26, 2007

Yahoo Pig and Google Sawzall

It is hard to tell from the limited documentation available, but the Pig project at Yahoo Research seems to have a lot in common with Sawzall at Google. Both are high level programming languages targeting massively parallel processing across huge clusters.

From the Pig project page:
We are creating infrastructure to support ad-hoc analysis of very large data sets. Parallel processing is the name of the game.

Our system runs on a cluster computing architecture, on top of which sit several layers of abstraction that ultimately bring the power of parallel computing into the hands of ordinary users.

The layers in between automatically translate user queries into efficient parallel evaluation plans, and orchestrate their execution on the raw cluster hardware.
This is similar to the motivation behind Sawzall. From the Sawzall paper:
To make effective use of large computing clusters in the analysis of large data sets, it is helpful to restrict the programming model to guarantee high parallelism ... Our approach includes a new programming language called Sawzall.

The language helps capture the programming model by forcing the programmer to think one record at a time, while providing an expressive interface to a novel set of aggregators that capture many common data processing and data reduction problems.

[Users can] write short, clear programs that are guaranteed to work well on thousands of machines in parallel ... The user needs to know nothing about parallel programming; the language and the underlying system take care of all the details.
Just as Google Sawzall is built on top of MapReduce, Yahoo Pig is built on top of Hadoop (an open source clone of MapReduce that is supported by Yahoo).

However, there do appear to be differences in the languages. Sawzall syntax appears heavily influenced by Java or Pascal, where Pig appears to be motivated by an attempt to extend SQL. For example, the Sawzall paper says:
The syntax of statements and expressions is borrowed largely from C; for loops, while loops, if statements and so on take their familiar form. Declarations borrow from the Pascal tradition.
The Pig documentation spends some time talking about how the language differs from SQL and why SQL is not sufficient:
The analogue to Pig Latin in the SQL world is "relational algebra." Pig Latin differs from the relational algebra in the following important ways:

1. The "join" operator is decomposed into two seperate operations: co−group and flatten.

2. Pig Latin has built−in support for nested data (it supports simple projection, filters, and sorting on nested constructs).

Why do we opt for Pig Latin over SQL? One reason is that many programmers don't "like" SQL, because it forces them to do acrobatics with their program logic, just to get it into a declarative "calculus" form. A final (but important) reason is that in general it can be difficult to convert complex SQL statements into efficient parallel programs.
Examples of some code in the two languages might be useful here. Here is an example of Sawzall code:
proto "querylog.proto"
static RESOLUTION: int = 5; # minutes; must be divisor of 60
log_record: QueryLogProto = input;
queries_per_degree: table sum[t: time][lat: int][lon: int] of int;
loc: Location = locationinfo(log_record.ip);
if (def(loc)) {
  t: time = log_record.time_usec;
  m: int = minuteof(t); # within the hour
  m = m - m % RESOLUTION;
  t = trunctohour(t) + time(m * int(MINUTE));
  emit queries_per_degree[t][int(][int(loc.lon)] <- 1;
And, here is an example of Pig code:
a = COGROUP QueryResults BY url, Pages BY url;
b = FOREACH a GENERATE FLATTEN(QueryResults.(query, position)), FLATTEN(Pages.pagerank);
c = GROUP b BY query;
d = FILTER c BY checkTop5(*);
I have to say, it is good to see Yahoo building these kinds of tools for large scale data manipulation.

Google's massive cluster and the tools built on top of it have been called a "competitive advantage", the "secret source of Google's power", and a "major force multiplier".

As Peter Norvig said, "It allows us to turn around the experiments much faster than the other guys ... We can get the answer in two hours which I think is a big advantage over someone else who takes two days."

Update: There is some discussion of a similar effort at Microsoft, the Dryad project, in the comments for this post.


Sriram said...

DryadLinq ( is something similar from us. It uses the extensible nature of Linq to make it run jobs on Dryad.

Unknown said...

I'm always curious about new languages, so I'm wondering what these two examples do. I get that the first one calculates # of queries per time slice by lat/long. But I can't read the second one at all... Any ideas?

Charles Gordon said...

It's worth noting that the syntax for Sawzall looks a lot like that of Scala. The Scala language is currently implemented on the JVM and interoperates nicely with Java. In addition it has libraries for Erlang-like concurrency constructs and support. More importantly, one of the authors on their OOPSLA 2005 paper was Matthias Zenger, who works for Google. That might not mean anything, but it's fun to speculate!

jsm said...

Interesting... hadn't heard about DryadLinq. I thought PLINQ (,guid,200c3151-fbd5-4bfe-bb1e-0d6b90c6442b.aspx) was Microsoft's entry into that field.

Greg Linden said...

Hi, Steve. Sorry, you're right, I probably should have put a header on the examples to explain what they do.

The Sawzall example looks "at a set of search query logs and construct a map showing how the queries are distributed around the globe."

The Pig example finds "queries for which the highest−pagerank page in the result set did not appear among the top 5 results."

Greg Linden said...

Hi, Sriram and Jeff. Thanks for the references to Microsoft's DryadLINQ and PLINQ.

From what I can tell, it appears PLINQ is designed for running on a single multi-core system, not a massive cluster. Or am I misreading slide 3 of the presentation you linked to, Jeff? Unfortunately, it appears there are no published papers on PLINQ and no project page yet, so I am having a bit of a hard time figuring out the details.

On DryadLINQ, from the 2007 paper (PDF) on Dryad, it appears that Dryad is more complicated but has similar goals to MapReduce.

The Dryad project page has a nice, high-level summary of the goals:

Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

From the paper, it appears that Dryad also has a high level scripting language called Nebula that may be similar in its goals to Sawzall.

As for DryadLINQ, it appears to be another language on top of Dryad, this time based on LINQ.

Is that all correct?

Sriram said...

Yup - that's a good summary. I'm a big fan of Linq's 'bring your own query processor' model - it is amazing to see so many extensions being built on top of it.

- Sriram Krishnan

Anonymous said...

Greg: I haven't looked at Y!'s little porky yet (though I point pigs, horses, and other animals to my kid all the time these days), but Hadoop has Abacus:

Perhaps that's what Pig really is, under its Y! skin.

E14 said...

Abacus is a program contributed by a Yahoo developer that does simple aggregations. Pig is a more ambitious project.

Both are open source, so why speculate? Check them out.

Anonymous said...

Does it run on Linux or Solaris or BSD? If not, it is DOA for most of us in the internet business. Today, it is exceptionally rare to see a web start-up using Windows. Your development tools may be cool, but tying them to Windows is a deal-breaker for us.

Ed Burnette said...

Do you have a new link for abacus? The one above doesn't work.

Edward Vielmetti said...

Christopher Olston did a talk on Pig at the U of Michigan today - I took notes and will figure out some way to post them to Slideshare.

I think it's called "Pig" because the Yahoo Hadoop infrastructure is so slow; they sound like they are working in a batch job processing environment with the minimum job turnaround time measured in hours, rather than the Google turn time measured in minutes. Perhaps the Yahoo guys need to rent some Amazon EC2 time.