Saturday, February 10, 2007

Can community help with data extraction?

Raghu Ramakrishnan from Yahoo Research gave a wide-ranging talk at University of Washington Computer Science titled "Community Systems: The World Online".

The lightweight talk covered topics from social search to advertising, but the primary focus was on using user communities to improve and extend information and structure extracted from the Web.

The motivation here is that search would be more useful if we had structured data (e.g. a book has a title, price, and number of pages) and understood relationships between data (e.g. an apple is a type of fruit). We may be able to extract that data from the Web (e.g. the KnowItAll project), but that extraction is noisy and unreliable.

So, what are we to do? Perhaps we can learn from people's actions and behavior. Not only is their much implicit information in the clickstream trail of what people do on the Web, but also millions of people surprisingly seem willing to help do a fair amount of work on the Web to improve data (e.g. Wikipedia, ESPGame, Yahoo Answers) with only token rewards.

The main idea here then appears to be to start by extracting structured information from the web, then build tools that make it easy for people to improve the data. Give the community a starting point from which to work, then make it easy for them to build and improve.

Raghu mentioned the DBLife project at a few points in his talk. The project page is not all that informative, but there is a little more information in a CIDR 2007 paper (PDF).

1 comment:

Anonymous said...

I'm biased since I work here, but I think Wesabe is a great example of this. We get very incoherent data about merchants from bank and credit card sites, and our users collaboratively edit that data so that it everyone can see their bank transactions with names that they recognize and understand rather than the 'bank puke' versions normally available on bank sites. There are problems with this -- we have one merchant called "Whole Foods" and another called "Whole Paycheck," for instance -- but in general a very small contribution from each user adds up to a huge benefit for everyone. There's more description of the process here if you're interested:

--Marc Hedlund, Wesabe