Saturday, January 19, 2008

MapReduce a step backwards?

A post by database gurus David DeWitt and Michael Stonebraker claims MapReduce is "a giant step backwards".

Some excerpts:
As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

1. A giant step backward in the programming paradigm for large-scale data intensive applications

2. A sub-optimal implementation, in that it uses brute force instead of indexing

3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

4. Missing most of the features that are routinely included in current DBMS

5. Incompatible with all of the tools DBMS users have come to depend on
The comments on the post are enjoyable and useful. Many rightfully point out that it might not be fair to compare a system like MapReduce to a full database. DeWitt and Stonebraker do partially address this, though, by not just limiting their criticism to GFS, but also going after BigTable.

The most compelling part of the post for me is their argument that some algorithms require random access to data, something that is not well supported by GFS, and it is not always easy or efficient to restructure those algorithms primarily to do sequential scans.

However, as the slides from one of Google's talks on MapReduce say, "MapReduce isn't the greatest at iterated computation, but still helps run the 'heavy lifting'" (slide 33, lecture 5). And those slides in both lectures 4 and 5 give examples of how iterated algorithms like clustering and PageRank can be implemented in a MapReduce framework.

Moreover, from the heavy usage of MapReduce inside of Google, it is clear that the class of computations that can be done in MapReduce reasonably efficiently (both in programmer time and computation) is quite substantial. It seems hard to argue that MapReduce is not supporting many of Google's needs for large scale computation.

Perhaps if the argument is that Google's needs are specialized and that others may find their computations more difficult to implement in a MapReduce framework, DeWitt and Stonebraker have a stronger point.

[Found via François Schiettecatte]

Update: Good rebuttal from Googler Mark Chu-Carroll. [Found via Dare Obasanjo]

Update: I finally got around to reading Stonebraker et al., "The End of an Architectural Era" (PDF) from VLDB 2007. The article is mostly about Stonebraker's H-store database, but I cannot help but notice that many of the points the authors make against RDBMs would seem to undermine the new claim that MapReduce is "a giant step backwards."

For example, Stonebraker et al. write, "RDBMs can be beaten handily ... [by] a specialized engine." They predict that "the next decade will bring domination by shared-nothing computer systems ... [and] DBMS should be optimized for this configuration .... We are heading toward a world ... [of] specialized engines and the death of the 'one size fits all' legacy systems." Finally, they write "SQL is not the answer" and argue for a system where the computation is done on the nodes near the data.

While the authors also make some more specific statements that do not apply quite as well to GFS/MapReduce/BigTable -- for example, that we will have "a grid of systems with main memory storage, build-in high availability, no user stalls, and useful transaction work in under 1 millisecond" -- I do not understand why they do not see MapReduce as a wonderful example of a specialized data store that runs over a massive shared-nothing system and keeps computation near the data.

Update: Stonebraker and DeWitt follow up in a new post, "MapReduce II". The new post can be summarized by this line from it: "We believe it is possible to build a version of MapReduce with more functionality and better performance." That almost certainly is true, but no one has yet, at least not at the scale Google is processing.

6 comments:

Anonymous said...

Hi!

The blog you are referencing is a marketing piece that Vertica has put together. Map/Reduce is a threat to their shared nothing/analytic solution that they have been shopping around.

They have some good points, but all of the articles found on that site are very self serving. All of the posts are thinly veiled attempts at attacking anything that remotely could threaten their product.

Map/Reduce is just another tool to put to use on a particular set of problems.

Cheers,
-Brian

Anonymous said...

This is fairly odd given that Stonebraker has been saying that traditional databases are old hat and we need new stream-oriented and column-ordereded stores for new large and distributed datasets and applications.

They may be right in saying that GFS+BigTable+MapReduce are not equivalent to a database. However, there is NO database out there (whether research or commercial) that scales to thousands of processors and petabytes of data, while remaining cheap and able to run on commodity hardware without any expensive RAID/SAN etc.

Anonymous said...

The Map/Reduce work has really been driven by practicality and may be specific to Google's set of issues. However, its practical basis is also what makes it compelling.

The Stonebreaker work is a bit conceptual. I'd like to see them demonstrate their system at Internet scale.

Greg Linden said...

Hi, Brian. Yes, I know Stonebraker's blog is biased, but, as you said, they have some good points. Even if they are self-serving, it is interesting to consider them.

By the way, I am a little surprised you are so quick to dismiss their points. I thought I remembered you and I having a discussion a while back about how the lack of structured data, lack of indexes, and inability to do complicated data queries may become an issue for GFS and BigTable as search becomes more and more sophisticated.

But, absolutely, I think you are right that there are a large and interesting class of problems where MapReduce works very well.

Iván de Prado said...

I do not understand why they create this controversial. I think MapReduce is not comparable with a database prepared for OLTP. And if they are talking about use the database for masive data analysis it is true that databases do not scale. In this cases MapReduce is an incredible good tool.

They talk about some "parallel databases" that are not being taken into account. I would like to know about wich databases they are talking. I do not know any famous parallel database that scales horizontally and with good performance.

Anonymous said...

Not sure if you're still fixing links on 2008 posts, but this one seems not to work:

http://www.mit.edu/~dna/vldb07hstore.pdf