Thursday, November 09, 2006

Hadoop on Amazon EC2

Hadoop, an open source clone of Google FS and MapReduce, can be run on top of Amazon EC2, a hosting service that allows leasing servers on an hourly basis.

The details of setting this up are available at the node "AmazonEC2" on the Lucene-Hadoop Wiki at Apache.org.

When looking for more about this, I noticed that the hyped-but-not-launched natural language search engine Powerset appears to be leading the charge on using Hadoop on EC2. From the Hadoop mailing list:
From: Gian Lorenzo Thione <thi...@powerset.com>
Date: Fri, 25 Aug 2006 23:04:16 GMT

At Powerset we have used EC2 and Hadoop with a large number of nodes, successfully running Map/Reduce computations and HDFS. Pretty much like you describe, we use HDFS for intermediate results and caching, and periodically extract data to our local network. We are not really using S3 at the moment for persistent storage.

A nice feature of Hadoop as measured against our use of EC2 has been the capability of fluidly changing the number of instances that are part of the cluster. Our instances are set up to join the cluster and the DFS as soon as they are activated and when - for any reason - we lose those machines, the overall process doesn't suffer. We have been quite happy with this, even at significant number of instances.
That is an interesting detail on the recent announcement that Powerset is a heavy user of Amazon's EC2.

I am not sure I have an immediate use for Hadoop on EC2, but it is nice to see. Developers may now be able to rapidly bring up hundreds of servers, run a massive parallel computation on them using Hadoop's MapReduce implementation, and then shut down all the instances, all with low effort and at low cost. Very cool.

[Wiki node found via John Krystynak]

Update: Eight months later, Tom White posts a tutorial, "Running Hadoop MapReduce on Amazon EC2 and Amazon S3". [Found via Todd Huff]

2 comments:

Anonymous said...

Yeah, I thought this was a pretty known tidbit, actually. I believe Amazon has Powerset story somewhere on the site as a success story.

Kamal said...

Thanks for this tutorial, it was very helpful!

I've discovered an even faster way to run hadoop (currently only on the cloud). I could setup a multi-node cluster in under one minute using Ubuntu's new cloud technology "Ensemble"

Check the video at
http://cloud.ubuntu.com/2011/08/ensemble-meets-hadoop-on-the-cloud/

and more technical details at
http://cloud.ubuntu.com/2011/08/hadoop-cluster-with-ubuntu-server-and-ensemble/