A few new papers out of Google cover some of their work on indexing the
deep web.
A Dec 2006 article, "Structured Data Meets the Web: A Few Observations" (
PS), appears to provide the best overview.
The paper starts by saying that "Google is conducting multiple efforts ... to leverage structured data on the web for better search." It goes on to talk about the scope of the problem as "providing access to
data about anything" since "data on the Web is about
everything."
The authors discuss three types of structured data, the deep web, accessible structured data such as Google Base, and annotation schemes such as tags.
Of most interest to me was the deep web, described in the paper as follows:
The deep (or invisible) web refers to content that lies hidden behind queryable HTML forms.
These are pages that are dynamically created in response to HTML form submissions, using structured data that lies in backend databases. This content is considered invisible because search engine crawlers rely on hyperlinks to discover new content. There are very few links that point to deep web pages and crawlers do not have the ability to fill out arbitrary HTML forms.
The deep web represents a major gap in the coverage of search engines: the content on the deep web is believed to be [vast] ... [and] of very high quality.
While the deep Web often has well structured data in the underlying databases, the Google authors argue that their application -- web search -- makes it undesirable to expose the deep Web structure. From the paper:
The reality of web search characteristics dictates ... that ... querying structured data and presenting answers based on structured data must be seamlessly integrated into traditional web search.
This principle translates to the following constraints:- Queries will be posed ... as keywords. Users will not pose complex queries of any form. At best, users will pick refinements ... that might be presented along with the answers to a keyword query.
- The general search engine [must] detect the correct user intention and automatically activate specialized searches.
- Answers from structured data sources ... should not be distinguished from the other results. While the research community might care about the distinction between structured and unstructured data, the vast majority of search users do not.
The authors discuss two major approaches to exposing the deep Web, virtual schemas and surfacing.
Virtual schemas reformulate queries "as queries over all (or a subset of) the forms" of the actual schemas. The virtual schema must be manually created and maintained. As the paper discusses, this makes it impractical for anything but narrow vertical search engines, since there are a massive number of potential domains on the Web ("data on the Web encompasses much of human knowledge"), the domains are poorly delineated, a large scale virtual schema would be "brittle and hard to maintain", and the performance and scalability of the underlying data sources is insufficient to support the flood of real-time queries.
Therefore, authors favor a "surfacing" approach. In this approach:
Deep web content is surfaced by simulating form submissions, retrieving answer pages, and putting them into the web index.
The main advantage of the surfacing approach is the ability to reuse existing indexing technology; no additional indexing structures are necessary. Further, a search is not dependent on the runtime characteristics of the underlying sources because the form submissions can be simulated offline and fetched by a crawler over time. A deep web source is accessed only when a user selects a web page that can be crawled from that source.
Surfacing has its disadvantages, the most significant one is that we lose the semantics associated with the pages we are surfacing by ultimately putting HTML pages into the web index. [In addition], not all deep web sources can be surfaced.
Given that Google is approaching the deep web as adding data to their current web crawl, it is not surprising that they are tending toward the more straightforward approach of just surfacing the deep web data to their crawl.
The paper also discusses structured data from Google Base and tags from annotation schemes. Of particular note there is that they discuss how the structured data in Google Base can be useful for query refinement.
At the end, the authors talk about the long-term goal of "a database of everything" where more of the structure of the structured deep web might be preserved. Such a database could "lead to better ranking and refinement of search results". It would have "to handle uncertainty at its core" because of inconsistency of data, noise introduced when mapping queries to the data sources, and imperfect information about the data source schemas. To manage the multitude of schemas and uncertainty in the information about them, they propose only using a loose coupling of the data sources based on analysis of the similarity of their schemas and data.
This specific paper is only one of several on this topic out of Google recently. In particular, "Web-scale Data Integration: You can only afford to Pay As You Go" (
PDF) goes further into this idea of a loose coupling between disparate structured database sources. Of most interest to me was the idea implied by the title, that PayGo attempts to "incrementally evolve its understanding of the data it emcompasses as it runs ... [understanding] the underlying data's structure, semantics, and relationships between sources", including learning from implicit and explicit feedback from users of the system.
Another two papers, "From Databases to Dataspaces" (
PDF) and "Principles of Dataspace Systems" (
PDF) also further discuss the idea of "a data co-existence approach" where the system is not in full control of its data, returns "best-effort" answers, but does try to "provide base functionality over all data sources, regardless of how integrated they are."