Comments on Geeking with Greg: Yahoo Research on distributed web search

Thanks, Fabrizio!

2007-08-02T07:54:00.000-07:00

Thanks, Fabrizio!

Hi Greg,nice post! (I know I'm a little bit biased...

2007-08-02T07:06:00.000-07:00

Hi Greg,

nice post! (I know I'm a little bit biased :P)...

About the idea mentioned a page 5... Actually we have published a couple of papers on that topic. You can find them on my home page... They contains some preliminary ideas and results but... if you want to see some outcomings of what happens playing around with shards... well... there you'll find some...

Incremental Caching for Collection Selection Architectures
Mining Query Logs to Optimize Index Partitioning in Parallel Web Search Engines

Enjoy!

All great points and totally correct. However, the...

2007-01-22T09:54:00.000-08:00

All great points and totally correct. However, these are all problems found in any peer to peer search as well.

Ultimately for either to get much traction, the system would need to embrace the strengths of a narrow but deep and distributed search architecture. The system may never be able to answer a document keyword search in a few milliseconds, but may be able to answer a question requiring a more exhaustive analysis of text with an answer quality worth waiting for.

Good point, Deep. Meta or federated search is a p...

2007-01-22T08:05:00.000-08:00

Good point, Deep. Meta or federated search is a possibility. As you said, Metacrawler, Dogpile, A9, and many others have built systems that query many search engines and databases, then merge the results.

Several problems come up when you try to do this though. First, relevance rank can be poor because of the limited information returned from each database or search engine (usually top N by widely differing relevance ranking functions).

Second, speed can be an issue. You can only as fast as the slowest database you query, and probably much slower due to the additional computation to merge and rerank the results.

Third, at very large scale, you can no longer query all the databases on every request, so you have the hard problem of determining which databases are most likely to return a useful answer to a query, a problem that may be almost as hard as building your own search index.

Even so, I agree that it is an interesting direction to pursue, especially for smaller players who cannot hope to duplicate the massive crawl and indexing clusters of the search giants.

The Bender paper is especially interesting. Thinki...

2007-01-21T20:53:00.000-08:00

The Bender paper is especially interesting. Thinking about the limitations of true peer to peer, made me think of a related idea. What if some of the myriad "vertical search" providers proliferating on the web could follow a standard for communicating the contents of their verticals along the lines of thinking in the Bender paper. Its not peer to peer, but to some extent it addresses the monopolistic concerns of the big 4, and it does address the dependability of a search node problem. In other words, create a meta search infrastructure, with built in economic accommodations of some sort, that vertical search providers could plug into; unlike the meta crawlers of yore that focus on meta crawling general search engines (e.g. metacrawler, dogpile, ...), they metacrawl niche vertical engines constrained by topic. If done right, this might lower the barrier of entry for small players to make an impact in broader search.

Greg,Thanks for passing along these great referenc...

2007-01-21T18:06:00.000-08:00

Greg,
Thanks for passing along these great references and thoughts.

Regarding client-side personalization of search, one has to wonder if Yahoo and Microsoft are pursuing this line more due to competitive weaknesses versus Google rather than the security red herring.

Similar to our recent discussion on recommendations, personalization is a much more fundamental concept than search. One can envision not only personalization of searches, but also personalization of recommendations, personalization of the user interface itself, etc. To try to force-fit the locus of personalization on the client-side when it will have to serve so many purposes strikes me old-think.