Friday, March 23, 2007

Google and the deep web

A few new papers out of Google cover some of their work on indexing the deep web.

A Dec 2006 article, "Structured Data Meets the Web: A Few Observations" (PS), appears to provide the best overview.

The paper starts by saying that "Google is conducting multiple efforts ... to leverage structured data on the web for better search." It goes on to talk about the scope of the problem as "providing access to data about anything" since "data on the Web is about everything."

The authors discuss three types of structured data, the deep web, accessible structured data such as Google Base, and annotation schemes such as tags.

Of most interest to me was the deep web, described in the paper as follows:
The deep (or invisible) web refers to content that lies hidden behind queryable HTML forms.

These are pages that are dynamically created in response to HTML ­form submissions, using structured data that lies in backend databases. This content is considered invisible because search ­engine crawlers rely on hyperlinks to discover new content. There are very few links that point to deep web pages and crawlers do not have the ability to fill out arbitrary HTML forms.

The deep web represents a major gap in the coverage of search engines: the content on the deep web is believed to be [vast] ... [and] of very high quality.
While the deep Web often has well structured data in the underlying databases, the Google authors argue that their application -- web search -- makes it undesirable to expose the deep Web structure. From the paper:
The reality of web search characteristics dictates ... that ... querying structured data and presenting answers based on structured data must be seamlessly integrated into traditional web search.

This principle translates to the following constraints:
  • Queries will be posed ... as keywords. Users will not pose complex queries of any form. At best, users will pick refinements ... that might be presented along with the answers to a keyword query.
  • The general search engine [must] detect the correct user intention and automatically activate specialized searches.
  • Answers from structured data sources ... should not be distinguished from the other results. While the research community might care about the distinction between structured and un­structured data, the vast majority of search users do not.
The authors discuss two major approaches to exposing the deep Web, virtual schemas and surfacing.

Virtual schemas reformulate queries "as queries over all (or a subset of) the forms" of the actual schemas. The virtual schema must be manually created and maintained. As the paper discusses, this makes it impractical for anything but narrow vertical search engines, since there are a massive number of potential domains on the Web ("data on the Web encompasses much of human knowledge"), the domains are poorly delineated, a large scale virtual schema would be "brittle and hard to maintain", and the performance and scalability of the underlying data sources is insufficient to support the flood of real-time queries.

Therefore, authors favor a "surfacing" approach. In this approach:
Deep web content is surfaced by simulating form submissions, retrieving answer pages, and putting them into the web index.

The main advantage of the surfacing approach is the ability to re­use existing indexing technology; no additional indexing structures are necessary. Further, a search is not dependent on the run­time characteristics of the underlying sources because the form submissions can be simulated off­line and fetched by a crawler over time. A deep web source is accessed only when a user selects a web page that can be crawled from that source.

Surfacing has its disadvantages, the most significant one is that we lose the semantics associated with the pages we are surfacing by ultimately putting HTML pages into the web index. [In addition], not all deep web sources can be surfaced.
Given that Google is approaching the deep web as adding data to their current web crawl, it is not surprising that they are tending toward the more straightforward approach of just surfacing the deep web data to their crawl.

The paper also discusses structured data from Google Base and tags from annotation schemes. Of particular note there is that they discuss how the structured data in Google Base can be useful for query refinement.

At the end, the authors talk about the long-term goal of "a database of everything" where more of the structure of the structured deep web might be preserved. Such a database could "lead to better ranking and refinement of search results". It would have "to handle uncertainty at its core" because of inconsistency of data, noise introduced when mapping queries to the data sources, and imperfect information about the data source schemas. To manage the multitude of schemas and uncertainty in the information about them, they propose only using a loose coupling of the data sources based on analysis of the similarity of their schemas and data.

This specific paper is only one of several on this topic out of Google recently. In particular, "Web-scale Data Integration: You can only afford to Pay As You Go" (PDF) goes further into this idea of a loose coupling between disparate structured database sources. Of most interest to me was the idea implied by the title, that PayGo attempts to "incrementally evolve its understanding of the data it emcompasses as it runs ... [understanding] the underlying data's structure, semantics, and relationships between sources", including learning from implicit and explicit feedback from users of the system.

Another two papers, "From Databases to Dataspaces" (PDF) and "Principles of Dataspace Systems" (PDF) also further discuss the idea of "a data co-existence approach" where the system is not in full control of its data, returns "best-effort" answers, but does try to "provide base functionality over all data sources, regardless of how integrated they are."

7 comments:

Eric Goldman said...

"Surfacing" could pose even more legal problems than their standard data collection practices to date, so surfacing could be a good technical solution without any legal support.

JT_Kane said...

Greg, I've followed your blog for several years now and value your opinion. I have saved and read the PDF version of this Google paper since it was published and I'm curious about what you think of the following quote and its effects on personalized search relative to the possible future use of Google's Custom Search Engines (CSE):

"In the case of Google Co-op, customized search engines can specify query patterns that trigger specifc
facets as well as provide hints for re-ranking search results. The annotations that any customized search engine
specifes are visible only within the context of that search engine. However, as we start seeing more custom
search engines, it would be desirable to point users to different engines that might be relevant."

from "Structured Data Meets the Web: A Few Observations" - http://www.cs.berkeley.edu/~jeffery/pubs/debull06-2.pdf

John

Greg Linden said...

Hi, John. I think the authors are referring more to query refinement than to personalized search there.

If I have read it correctly, they are suggesting that they might point a user to a couple customized search engines that might help them narrow their search.

That strikes me as most closely related to query refinement (where a search engine suggests new or modified search terms) than to personalized search (where the search engine shows different results based on your past behavior).

Shane said...

Greg, it looks like Gregory Piatetsky-Shapiro has noticed the Google changes on his KDnuggets website.

Pranav said...

Here's another article in the May edition of CACM on the topic: http://eagle.cs.uiuc.edu/tr/dwsurvey-tr-hpzc-jul04.pdf

Pranav said...

Sorry, link I posted is broken. Here's the link to the article on the ACM portal:
http://delivery.acm.org/10.1145/1250000/1241670/p94-he.html?key1=1241670&key2=9014858711&coll=ACM&dl=ACM&CFID=15151515&CFTOKEN=6184618

Not sure if this too has restricted access.

Mark Papadakis said...

Google can rely on its obiquitious AdSense code for discovering a good percentage of the 'invisible' portion of the Web.
Google is crawling pages supplemented with AdSense code and get access to content it would not be able to access otherwise ( no links pointing to them ).
The same can be said for their Google Analytics project. Same benefits. Those two systems combined can provide Google with all the information it would ever need for uncovering 'invisible' content.