Saturday, January 31, 2009

How Google crawls the deep web

A googol of Googlers published a paper at VLDB 2008, "Google's Deep-Web Crawl" (PDF), that describes how Google pokes and prods at web forms to see if it can find things to submit in the form that yield interesting data from the underlying database.

An excerpt from the paper:
This paper describes a system for surfacing Deep-Web content; i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index.

Our objective is to select queries for millions of diverse forms such that we are able to achieve good (but perhaps incomplete) coverage through a small number of submissions per site and the surfaced pages are good candidates for selection into a search engine's index.

We adopt an iterative probing approach to identify the candidate keywords for a [generic] text box. At a high level, we assign an initial seed set of words as values for the text box ... [and then] extract additional keywords from the resulting documents ... We repeat the process until we are unable to extract further keywords or have reached an alternate stopping condition.

A typed text box will produce reasonable result pages only with type-appropriate values. We use ... [sampling of] known values for popular types ... e.g. zip codes ... state abbreviations ... city ... date ... [and] price.
Table 5 in the paper shows the effectiveness of the technique, that they are able to retrieve a significant fraction of the records in small and normally hidden databases across the Web with only 500 or less submissions to the form. The authors also say that "the impact on our search traffic is a significant validation of the value of Deep-Web content."

Please see also my April 2008 post, "GoogleBot starts on the deep web".

6 comments:

Anonymous said...

I would be pissed if Google was probing my web form 500+ times to get at information.

"Do no evil" my ass.

Jason Cartwright said...

Anon. If your website can't handle a public form being submitted 500 times then you've either:

* Got some pitifully small, under-powered, and unpopular website that Google probably wouldn't bother indexing that much content
* You need to hire someone to fix your code/db/infrastructure/whatever so that when 500 users use your form it doesn't die

Have you thought that the benefit of this is that you'll get more traffic to your site?

Personally - Google can pull whatever they want from my sites, the cost of providing this to them is thousands of times less than the amount of money we then make off the users they send us.

Panos Ipeirotis said...

@Jason: Having built some deep-web crawlers myself, I can tell you that many query-driven websites do not appreciate random queries submitted to their websites, while Google tries to find the keywords that match the underlying database content. It slows down considerably the service for their "human" users.

Yes, the owners of such websites can add the appropriate entry in the robots.txt to prevent querying but since such behavior was not the norm in the past, webmasters are did not have that in place.

(I remember our frightened librarian calling me to figure out what are these queries about Sherlock Holmes submitted to our technical report repository.)

jeremy said...

I haven't read the paper yet, but I did peruse the list of citations. There was a lot of work on this type of query-based deep web sampling from 1997-2000, and I see absolutely none of this work cited. If anyone is interested, I could spend some time re-finding those citations. But it does strike me as rather odd that this paper cites none of that entire body of work.

I had a professor in grad school who would often lament that the computer science field had a very short-term memory; it would reinvent itself approximately every 10 years, mainly because one decade seemed to be about the length of its institutional memory. As ideas came and went in cyclic popularity, by the time and idea came around again, there weren't any people working on it that had been there the first time around, so everything would be rediscovered, and no literature would be cited.

This seems to be what is happening here, too.

It makes me feel like we're not really putting to good use all of the search and discovery tools that everyone has been developing.

Anonymous said...

Jeremy, I share your thoughts.

Jan-Willem Bobbink said...

If found a paper by Google engineers on how they use this to index enormous content websites like hotel.com for example.

Read more on http://dollar.biz.uiowa.edu/~street/he11.pdf (PDF)