Sunday, January 21, 2007

Yahoo Research on distributed web search

Gary Price at ResourceShelf points to a good, new paper out of Yahoo Research by Baeza-Yates et al, "Challenges in Distributed Information Retrieval" (PDF).

The paper is a summary of the challenges in building a web search engine that can scale to handle the expected growth of the web in the next 3-5 years.

Some excerpts:
In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters.

The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines.

In this paper we discuss the main issues with the design of a distributed Web retrieval system, including discussions on the state of the art.
The paper goes on to discuss the problems that come up for crawling, indexing, and query processing when, instead of all being in the same data center, clusters are distributed across a WAN. It is a great overview and a worthwhile read.

I wanted to highlight a few small but interesting points in the paper that I suspect would be easy to overlook.

First, I really like the idea mentioned on page 5 of the paper of optimizing the distributed index by "using query logs to partition the document collection and to route queries." What would be really fun to see is this optimization also being done continuously, replicating and moving index shards in real-time in response to access patterns.

Second, the paper very briefly mentions the idea of a peer-to-peer model for a web search engine. On the one hand, the paper makes this idea sound intriguing by talking about all the problems of doing distributed web search and the number of machines required (possibly 1.5M in 2010). On the other hand, there is a big difference between 30+ clusters on a WAN with 50k+ fast-interconnect, trusted machines in each cluster and using millions of untrusted machines on a P2P network.

The Baeza-Yates paper does not explore P2P web search further, it does cite "P2P Content Search: Give the Web Back to the People" (PDF) which, in turn, cites "On the Feasibility of Peer-to-Peer Web Indexing and Search" (PS). This second paper argues that P2P web search, at least in a naive implementation, would consume most of the bandwidth on the internet and therefore is not feasible. Worth a peek if you, like me, think this might be an intriguing idea to explore.

Finally, the Baeza-Yates paper mentions personalized search. I think that is interesting both because Yahoo researchers are thinking about personalized search and, like folks at Microsoft Research, they seem to be leaning toward a client-side implementation:
When query processing involves personalization of results, additional information from a user profile is necessary at search time, in order to adapt the search results according to the interests of the user.

An additional challenge related to personalization of Web search engines is that each user profile represents a state, which must be the latest state and be consistent across replicas.

Alternatively, a system can implement personalization as a thin layer on the client-side. This last approach is attractive because it deals with privacy issues related to centrally storing information about users and their behavior.
However, a client-side approach, as the MSR researches say, has the "disadvantage of ... [no] direct access to details of the Web corpus" (which limits the extent of the personalization) and, as the Yahoo researchers point out, "limits the user to always using the same terminal."


Steve Flinn said...

Thanks for passing along these great references and thoughts.

Regarding client-side personalization of search, one has to wonder if Yahoo and Microsoft are pursuing this line more due to competitive weaknesses versus Google rather than the security red herring.

Similar to our recent discussion on recommendations, personalization is a much more fundamental concept than search. One can envision not only personalization of searches, but also personalization of recommendations, personalization of the user interface itself, etc. To try to force-fit the locus of personalization on the client-side when it will have to serve so many purposes strikes me old-think.

Deep said...

The Bender paper is especially interesting. Thinking about the limitations of true peer to peer, made me think of a related idea. What if some of the myriad "vertical search" providers proliferating on the web could follow a standard for communicating the contents of their verticals along the lines of thinking in the Bender paper. Its not peer to peer, but to some extent it addresses the monopolistic concerns of the big 4, and it does address the dependability of a search node problem. In other words, create a meta search infrastructure, with built in economic accommodations of some sort, that vertical search providers could plug into; unlike the meta crawlers of yore that focus on meta crawling general search engines (e.g. metacrawler, dogpile, ...), they metacrawl niche vertical engines constrained by topic. If done right, this might lower the barrier of entry for small players to make an impact in broader search.

Greg Linden said...

Good point, Deep. Meta or federated search is a possibility. As you said, Metacrawler, Dogpile, A9, and many others have built systems that query many search engines and databases, then merge the results.

Several problems come up when you try to do this though. First, relevance rank can be poor because of the limited information returned from each database or search engine (usually top N by widely differing relevance ranking functions).

Second, speed can be an issue. You can only as fast as the slowest database you query, and probably much slower due to the additional computation to merge and rerank the results.

Third, at very large scale, you can no longer query all the databases on every request, so you have the hard problem of determining which databases are most likely to return a useful answer to a query, a problem that may be almost as hard as building your own search index.

Even so, I agree that it is an interesting direction to pursue, especially for smaller players who cannot hope to duplicate the massive crawl and indexing clusters of the search giants.

Deep said...

All great points and totally correct. However, these are all problems found in any peer to peer search as well.

Ultimately for either to get much traction, the system would need to embrace the strengths of a narrow but deep and distributed search architecture. The system may never be able to answer a document keyword search in a few milliseconds, but may be able to answer a question requiring a more exhaustive analysis of text with an answer quality worth waiting for.

Fabrizio Silvestri said...

Hi Greg,

nice post! (I know I'm a little bit biased :P)...

About the idea mentioned a page 5... Actually we have published a couple of papers on that topic. You can find them on my home page... They contains some preliminary ideas and results but... if you want to see some outcomings of what happens playing around with shards... well... there you'll find some...

Incremental Caching for Collection Selection Architectures
Mining Query Logs to Optimize Index Partitioning in Parallel Web Search Engines


Greg Linden said...

Thanks, Fabrizio!