Sunday, April 01, 2007

Knowledge extraction from search queries

Googler Marius Pasca is first author on a couple recent papers on knowledge extraction from the Web.

"Organizing and Searching the World Wide Web of Facts - Step One: the One-Million Fact Extraction Challenge" (PDF) is "a first concrete step towards building large searchable repositories of factual knowledge" from the information scattered across the World Wide Web:
A particularly useful type of knowledge for Web search consists in binary relations associated to named entities ... e.g. the facts that "the capital of Australia is Canberra", or "Mozart was born in 1756", or "Apple Computer is headquartered in Cupertino".

A search engine with access to hundreds of millions of such Web-derived facts can answer directly fact-seeking queries, including fully-fledged questions and database-like queries (e.g., "companies headquartered in Mountain View"), rather than providing pointers to the most relevant documents that may contain the answers.

Moreover, for queries referring to named entities, which constitute a large portion of the most popular Web queries, the facts provide alternative views of the search results, e.g., by presenting the birth year and bestselling album for singers, or headquarters, name of CEO and stock symbol for companies, etc.
As described, this Google paper seems to have a lot in common with the KnowItAll project (which is cited) and other past attempts to do knowledge extraction from the Web.

Things get more interesting once we add in the second paper, "What You Seek is What You Get: Extraction of Class Attributes from Query Logs" (PDF). This paper focuses on using user behavior expressed in query logs for knowledge extraction. From the paper:
In a significant departure from previous approaches to large-scale information extraction, the target information (in this case, class attributes) is not mined from document collections. Instead, we explore the role of Web query logs, rather than documents, as an alternative source of class attributes.

To our knowledge this corresponds to the first endeavor in large-scale knowledge acquisition from query logs.

At first sight, choosing queries over documents as the source data may seem counterintuitive ... Indeed, common wisdom suggests that textual documents tend to assert information ... Comparatively, search queries can be thought of as ... approximations of often underspecified user information needs (interrogations).

However, users formulate their queries based on the common-sense knowledge that they already possess at the time of the search. Therefore, search queries play two roles simultaneously: in addition to requesting new information, they also indirectly convey knowledge in the process ... If knowledge is generally prominent or relevant, people will (eventually) ask about it.

The Web as a whole represents a huge repository of human knowledge ... Web search queries as a whole also mirror a significant amount of knowledge.
I love this idea of using the implicit information in people's behavior to assist with knowledge extraction. Searchers indirectly convey knowledge when they search. That knowledge is concealed within search histories and clickstreams.

Every day, millions seek knowledge on the Web. These actions, if we can understand them, can act as a guide to others, people helping people find the information they seek.

It appears Google will be expanding on both of these papers in an upcoming WWW 2007 paper, "Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds" (abstract). That paper will present a method for "weakly supervised extraction of class attributes (e.g., 'side effects' and 'generic equivalent' for drugs) from anonymized query logs" that has "accuracy levels significantly exceeding current state of the art."

Update: "Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds" (PDF) is now available and well worth reading. As the author says, "The most intriguing aspect of [search log] queries is ... their ability to indirectly capture human knowledge."

1 comment:

Anonymous said...

Greg, I don't know how to thank you. A very timely post, I felt as if it was written just for me.

I am working on a Human intelligence layer for google searches. I had the same concept that when I know what the people are searching for then why should I give documents I should be able to give them what they want. I will be going through these pdfs, please keep writing such posts. Thanks again.