Federated search (or metasearch) is when a search query is sent out to many other search engines, then the results merged and reranked.
In more complicated forms, the federated search engine may build a virtual schema that merges all the underlying databases, map the original query to the different query languages of the individual source databases, only query the databases that have a high likelihood of returning good answers, resolve inconsistencies between the databases, and combine multiple results from multiple sources to produce the final answers.
The Googlers on the "Structured Data Meets the Web: A Few Observations" (PS) Dec 2006 IEEE paper make several arguments against this approach succeeding at large scale:
The typical solution promoted by work on webdata integration is based on creating a virtual schema for a particular domain and mappings from the fields of the forms in that domain to the attributes of the virtual schema. At query time, a user fills out a form in the domain of interest and the query is reformulated as queries over all (or a subset of) the forms in that domain.Google instead prefers a "surfacing" approach which, put simply, is making a local copy of the deep web on Google's cluster.
For general web search, however, the approach has several limitations that render it inapplicable in our context. The first limitation is that the number of domains on the web is large, and even precisely defining the boundaries of a domain is often tricky ... Hence, it is infeasible to design virtual schemata to provide broad web search on such content.
The second limitation is the amount of information carried in the source descriptions. Although creating the mappings from webform fields to the virtual schema attributes can be done at scale, source descriptions need to be much more detailed in order to be of use here. Especially, with the numbers of queries on a major search engine, it is absolutely critical that we send only relevant queries to the deep web sites; otherwise, the high volume of traffic can potentially crash the sites. For example, for a car site, it is important to know the geographical locations of the cars it is advertising, and the distribution of car makes in its database. Even with this additional knowledge, the engine may impose excessive loads on certain web sites.
The third limitation is our reliance on structured queries. Since queries on the web are typically sets of keywords, the first step in the reformulation will be to identify the relevant domain(s) of a query and then mapping the keywords in the query to the fields of the virtual schema for that domain. This is a hard problem that we refer to as query routing.
Finally, the virtual approach makes the search engine reliant on the performance of the deep web sources, which typically do not satisfy the latency requirements of a websearch engine.
Not only does this provide Google the performance and scalability necessary to use the data in their web search, but also it allows them to easily compare the data with other data sources and transform the data (e.g. to eliminate inconsistencie and duplicates, determine the reliability of a data source, simplify the schema or remap the data to an alternative schema, reindex the data to support faster queries for their application, etc.).
Google's move away from federated search is particularly intriguing given that Udi Manber, former CEO of A9, is now at Google and leading Google's search team. A9, started and built by Udi with substantial funding from Amazon.com, was a federated web search engine. It supported queries out to multiple search engines using the OpenSearch API format they invented and promoted. A9 had not yet solved the hard problems with federated search -- they made no effort to route queries to the most relevant data sources or do any sophisticated merging of results -- but A9 was a real attempt to do large scale federated web search.
If Google is abandoning federated search, it may also have implications for APIs and mashups in general. After all, many of the reasons given by the Google authors for preferring copying the data over accessing it in real-time apply to all APIs, not just OpenSearch APIs and search forms. The lack of uptime and performance guarantees, in particular, are serious problems for any large scale effort to build a real application on top of APIs.
Lastly, as law professor Eric Goldman commented, the surfacing approach to the deep web may be the better technical solution, but it does have the potential of running into legal issues. Copying entire databases may be pushing the envelope on what is allowed under current copyright law. While Google is known for pushing the envelope, yet another legal challenge may not be what they need right now.