Here's an interesting technical problem. You work at a major search engine crawling billions of pages. How do you identify that two sites are copies of each other?
This problem turns out to be quite important to relevance rank. If two sites are copies of each other, they'll tend to distort PageRank (and similar link-based relevance ranks). The obvious solution, computing pairwise similarity of all documents, is completely impractical on a dataset of this size.
This question also seems to be a popular in interviews at Google. When I talked to them over a year ago, three different people asked me how I'd solve this problem. At the time, I said I'd use a content-based technique that, briefly, would compute several signatures for samples of the text of the documents and then look for matches for those signatures.
Turns out our good friend Jeff Dean co-authored a paper (PDF) on this very topic. The paper analyzes the performance of several techniques for detecting mirrors, from simple approaches like the similar IP address or hostname to more complicated and quite clever analysis of the link structure of sites. The paper concludes that a content-based approach (called "shingles" in the paper) works well but that a combination of several approaches works best.
The paper is accessible. Worth a peek if you have any interest in the topic.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment