Thursday, July 06, 2006

Yahoo building a Google FS clone?

The Hadoop open source project is building a clone of the powerful Google cluster tools Google File System and MapReduce.

I was curious to see how much Yahoo appears to be involved in Hadoop. Doug Cutting, the primary developer of Lucene, Nutch, and Hadoop, is now working for Yahoo but, at the time, that hiring was described as supporting an independent open source project.

Digging further, it seems Yahoo's role is more complicated. Browsing through the Hadoop developers mailing list, I can see that more than a dozen people from Yahoo appear to be involved in Hadoop.

In some cases, the involvement is deep. One of the Yahoo developers, Konstantin Shvachko, produced a detailed requirement document for Hadoop. The document appears to lay out what Yahoo needs from Hadoop, including such tidbits as handling 10k+ nodes, 100k simultaneous clients, and 10 petabytes in a cluster.

Also noteworthy is Eric Baldeschwieler, a director of software development at Yahoo, who recently talked about direct support from Yahoo for Hadoop. Eric said, "How we are going to establish a testing / validation regime that will support innovation ... We'll be happy to help staff / fund such a testing policy."

There is nothing wrong with this, of course. If anything, it should be viewed as noble that Yahoo is supporting an open source version of these powerful tools and making them available to all.

But it is interesting. It is interesting that Yahoo is so involved in building a Google FS and MapReduce clone. It is interesting that Yahoo would choose to open source these tools. It is interesting to see this level of involvement from Yahoo in Hadoop.

6 comments:

johnrob said...

This type of operation is not that uncommon in the tech space. Essentially, Yahoo would like to commoditize search technology. Assuming such an effort were to succeed, Google would lose a competitive edge while Yahoo would not - their biggest barrier to entry is Brand and Community, which are not directly related to technology. Google's edge is still tech related.

Miles Barr said...

If that's true, I'd be interested to see if this patent does cover GFS. And if it does, will Google enforce it?

Otis Gospodnetic said...

I actually thought all this was widely known already, but it appears that it's still news. Yes, that patent may be a problem (same name appears associated with both the patens and the MapReduce paper), although I imagine Yahoo folks looked into that before investing in Hadoop.

Ryan said...

Is it really all that different from the open source support Yahoo! throws behind projects like FreeBSD and PHP?

Alex said...

You're speaking of large-scale storage as something Yahoo! would do in distant future, but the company currently offers:

Yahoo Mail with 1 GB of storage.
Yahoo Briefcase with admittedly small 30 megs of storage.
Yahoo Photos with unlimited storage.
Flickr with limited storage.

The components are already in place and available to any Yahoo user pretty much from any location in the world. Sounds more like a business necessity for Yahoo today than cool research project tomorrow.

Kevin said...

The patents might not be an issue.

There have been rumors that Google wants to Open Source their MapReduce implementation as well as other tools so Yahoo might just end up putting presure on them to do so....