Friday, March 28, 2008

Hadoop Summit notes

James Hamilton posts detailed notes ([1] [2] [3] [4] [5]) on the talks at the Hadoop Summit.

Hadoop is the open source version of Google GFS and MapReduce. Yahoo is pushing a lot of the development of Hadoop.

The notes cover talks on Hadoop, Yahoo Pig, Microsoft Dryad, Yahoo HBase, Facebook Hive, Yahoo ZooKeeper, and a few applications using these platforms.

James' notes are excellent, long, and detailed. I found several noteworthy tidbits in there.

Most of the Hadoop clusters are small, 2k nodes or less, often just tens of nodes. They have plans to expand to 5k. However, this seems to conflict with Yahoo's earlier announcement that they have Hadoop running on a "10,000 core Linux cluster".

They have hit problems with scheduling jobs -- "FIFO scheduling doesn't scale for large, diverse user bases", "We're investing heavily in scheduling to handle more concurrent jobs" -- but an attempt to deal with this by breaking one cluster into many virtual clusters called Hadoop on Demand is not working well -- "HoD scheduling implementation has hit the wall ... HoD was a good short term solution but not adequate for current usage levels. It's not able to handle the large concurrent job traffic Yahoo! is currently experiencing."

Joins are "hard to write in Hadoop" and even harder to optimize, which is part of the motivation for the higher level Pig language built on top of Hadoop. While the Pig team argues that their language is simpler and easier to use than SQL, they do have plans to write an SQL-like processing layer on top of Pig.

On the topic of joins, there apparently was a brief discussion comparing Hadoop/MapReduce data processing to more traditional databases where the primary difference mentioned was that databases have indexes where Hadoop and MapReduce have to create the equivalent of those indexes ad hoc for each job. While that is not a new point, considering it again makes me wonder if there might be a middle ground here where we retain older extracts and other intermediate results and reuse them for similar computations over the same data. That would have a similar effect as an index, but would have the advantages of being built on demand and targeted to the current workloads.

If there are other tidbits you found of interest in the notes (or you attended the conference and found other things to be of interest), please add a comment to this post!

Update: James adds a summmary post with a few more thoughts.

Update: As Doug Cutting and Chad Walters point out in the comments, a 2k node cluster easily could have 10k cores, so I struck my statement that only having 2k nodes seems to be in conflict with Yahoo's earlier announcement.

7 comments:

Chad said...

I think you are confusing nodes (a whole machine) and cores.

2000 nodes with 4 cores in each = 8000 cores.

2000 nods with 8 cores in each = 16000 cores.


So they could easily have 10,000 or more cores with 2500 4-core nodes or with fewer nodes and a mix of 4-core and 8-core boxes.

Greg Linden said...

Thanks, Chad. I was thinking that too, but decided that an average of 5 cores per node sounded like a lot. You are probably right though.

This may mean that the original Yahoo announcement of 10k cores in their cluster was a little misleading (or, at least, easily subject to misinterpretation).

Doug Cutting said...

Or perhaps just a round number? Why assume malice?

A cluster may have 1400 eight-core nodes of which 1347 are currently running, which, in round figures, is around 10k cores.

Greg Linden said...

Hi, Doug, good to hear from you! Thanks for coming by!

Sorry, I didn't mean to imply malice. My mistake, I should have been more careful with my phrasing.

Andrew Hitchcock said...

I don't think it should be called Yahoo HBase. It has been part of Hadoop from the very early stages and is now its own subproject. If anything–judging by the mailing list and contributors–it should be called Powerset HBase.

Greg Linden said...

Good point, Andrew. I'll fix it, thanks!

eric baldeschwieler said...

Pig, Hadoop and HBase are all Apache projects. Apache Pig/Hadoop/HBase would be the most correct usage. We are proud to be the largest contributors to Hadoop and Pig. But we don't own the projects.

I'm very excited to see other Hadoop community members taking the lead on HBase, Mahout and other projects!