Thursday, March 26, 2009

Semantic interpretation and the effectiveness of big data

Googlers Alon Halevy, Peter Norvig, and Fernando Pereira have an article, "The Unreasonable Effectiveness of Data" (PDF), in the April 2009 IEEE Intelligent Systems on semantic interpretation using big data.

Some excerpts:
The number of grammatical English sentences is theoretically infinite ... However, in practice we humans care to make only a finite number of distinctions. For many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need.

We're left with ... interpreting the content, which is mainly that of learning as much as possible about the context of the content to correctly disambiguate it .... What we need are methods to infer relationships between ... entities in the world. These inferences may be incorrect at times, but if they're done well enough we can connect disparate data collections and thereby substantially enhance our interaction with Web data.

Unlabeled data ... is so much more plentiful than labeled data ... With very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there.
The article talks in more detail about work at Google and elsewhere on extracting relationships from massive crawls of text, tables, and the deep web.

On a related note, Google announced some new features a couple days ago, improved query suggestions and snippets, that Googler Ori Allon apparently described as scanning pages "in real-time ... after a query is entered" and identifying "conceptually and contextually related sites/pages" using "an 'understanding' of content and context." Many news articles are referring to this as a step toward semantic search.

Please see also my April 2008 post, "GoogleBot starts on the deep web", which discusses related work by Alon Halevy on mining data in tables and the deep web.

Please see also my post on the WSDM 2008 keynote by Oren Etzioni on semantic interpretation. His work is mentioned a few times by Halevy et al.

[IEEE article found via the Google Research Blog]


jeremy said...

Alright, Greg. Three years ago you asked/encouraged me to start my own blog. Well, it took me a while, but I finally got around to it. Here's my response to this sort of approach :-)

Chintan said...

I agree with the hypothesis of big-data obliviating the need for rich models but I do have couple of concerns with this paper/article:

1. Other than Google, how many institutions/companies have the infrastructure and resources to store data/and perform such computations?

2. Also the paper/article sounds very "retrospective" in the sense that authors are merely telling us what has worked for them (Google) - I don't see much "predictions" of what to expect other just co-occurence calculations

3. I'm getting a feeling Google is UNDOing all the computer science knowledge. First they tell us about MapReduce - a distribute-collect mechanism studied back in 80s. Now they come and say NO NO dont use these rich models, we have this big data and just word co-occurence work just fine.

PS. I really liked the clarification of "Semantics" in Semantic Web.

Marko said...


Slightly off-topic question. Do you know if anyone has made a PDF recommender, kind of a feed of interesting academic or serious articles (PDF denoting it isn't just a blog entry but something more considered)? I appreciate that every months I can check your blog and there's bound to be some interesting new article to read, and similarly I can keep an eye on the most popular ACM articles (they used to put these in the back of CACM not sure if it is still there), or browse conference proceedings. I don't know if anyone collates the most blogged-about or twittered-about articles for instance. Let alone personalised recommendations.

Any ideas?



Greg Linden said...

Mendeley is an example of this that has been in the news a bit lately.

Citeseerx and others also recommend related papers for a given paper.

Hope that helps, Marko!