Friday, February 15, 2008

Oren Etzioni at WSDM 2008

University of Washington Professor Oren Etzioni gave a fun and well done keynote talk in the second day of WSDM 2008 titled "Machine Reading at Web Scale".

Oren described his motivation as looking out at what search would be like in 2020. After quoting Alan Kay as saying, "The best way to predict the future is to invent it," Oren argued that "instead of merely returning pages", computers should "read 'em."

Obviously, we are not going to solve the entire natural language understanding problem today (or, for that matter, in the next few decades), but Oren pointed out that we can make much progress just with "some level of understanding" of the pages.

Specifically, Oren discussed some of the ideas from TextRunner which extracts facts in the form (noun phrase, relation/verb, noun phrase) tuples. For example, TextRunner might learn "(Tesla, invented, coil transformer)" from scouring over web documents. The demo and paper (PDF) have details.

Oren then talked about how they achieved what appear to be very high accuracy rates (nearly 90% for concrete facts) using a combination of a voting scheme (called "urns" and detailed in an IJCAI 2005 paper (PDF)) and, when data is too sparse for urns, what he called a "statistical 'type check'" that looks at whether the relationship appears to make sense when compared to similar data (e.g. "Does 'Pinkerton' behave like a mayor" in terms of having similar relations as other entities labeled as a mayor).

Update: Oren's talk is now available online.

No comments: