From the Pig project page:
We are creating infrastructure to support ad-hoc analysis of very large data sets. Parallel processing is the name of the game.This is similar to the motivation behind Sawzall. From the Sawzall paper:
Our system runs on a cluster computing architecture, on top of which sit several layers of abstraction that ultimately bring the power of parallel computing into the hands of ordinary users.
The layers in between automatically translate user queries into efficient parallel evaluation plans, and orchestrate their execution on the raw cluster hardware.
To make effective use of large computing clusters in the analysis of large data sets, it is helpful to restrict the programming model to guarantee high parallelism ... Our approach includes a new programming language called Sawzall.Just as Google Sawzall is built on top of MapReduce, Yahoo Pig is built on top of Hadoop (an open source clone of MapReduce that is supported by Yahoo).
The language helps capture the programming model by forcing the programmer to think one record at a time, while providing an expressive interface to a novel set of aggregators that capture many common data processing and data reduction problems.
[Users can] write short, clear programs that are guaranteed to work well on thousands of machines in parallel ... The user needs to know nothing about parallel programming; the language and the underlying system take care of all the details.
However, there do appear to be differences in the languages. Sawzall syntax appears heavily influenced by Java or Pascal, where Pig appears to be motivated by an attempt to extend SQL. For example, the Sawzall paper says:
The syntax of statements and expressions is borrowed largely from C; for loops, while loops, if statements and so on take their familiar form. Declarations borrow from the Pascal tradition.The Pig documentation spends some time talking about how the language differs from SQL and why SQL is not sufficient:
The analogue to Pig Latin in the SQL world is "relational algebra." Pig Latin differs from the relational algebra in the following important ways:Examples of some code in the two languages might be useful here. Here is an example of Sawzall code:
1. The "join" operator is decomposed into two seperate operations: co−group and flatten.
2. Pig Latin has built−in support for nested data (it supports simple projection, filters, and sorting on nested constructs).
Why do we opt for Pig Latin over SQL? One reason is that many programmers don't "like" SQL, because it forces them to do acrobatics with their program logic, just to get it into a declarative "calculus" form. A final (but important) reason is that in general it can be difficult to convert complex SQL statements into efficient parallel programs.
proto "querylog.proto"
static RESOLUTION: int = 5; # minutes; must be divisor of 60
log_record: QueryLogProto = input;
queries_per_degree: table sum[t: time][lat: int][lon: int] of int;
loc: Location = locationinfo(log_record.ip);
if (def(loc)) {
t: time = log_record.time_usec;
m: int = minuteof(t); # within the hour
m = m - m % RESOLUTION;
t = trunctohour(t) + time(m * int(MINUTE));
emit queries_per_degree[t][int(loc.lat)][int(loc.lon)] <- 1;
}
And, here is an example of Pig code:a = COGROUP QueryResults BY url, Pages BY url;
b = FOREACH a GENERATE FLATTEN(QueryResults.(query, position)), FLATTEN(Pages.pagerank);
c = GROUP b BY query;
d = FILTER c BY checkTop5(*);
I have to say, it is good to see Yahoo building these kinds of tools for large scale data manipulation.Google's massive cluster and the tools built on top of it have been called a "competitive advantage", the "secret source of Google's power", and a "major force multiplier".
As Peter Norvig said, "It allows us to turn around the experiments much faster than the other guys ... We can get the answer in two hours which I think is a big advantage over someone else who takes two days."
Update: There is some discussion of a similar effort at Microsoft, the Dryad project, in the comments for this post.