Thursday, August 07, 2008

Clever method of near duplicate detection

Martin Theobald, Jonathan Siddharth, and Andreas Paepcke from Stanford University have a cute idea in their SIGIR 2008 paper, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections" (PDF). They focus near duplicate detection on the important parts of a web pages by using the next few words after a stop word as a signature.

An extended excerpt:
The frequent presence of many diverse semantic units in individual Web pages makes near-duplicate detection particularly difficult. Frame elements for branding, and advertisements are often freely interspersed with other content elements. The branding elements tend to be deliberately replicated across the pages of individual sites, creating a "noise level" of similarity among pages of the same site.

SpotSigs provides a robust scheme for extracting characteristic signatures from Web documents, thus aiming to filter natural-language text passages out of noisy Web page components ... [It] is much more efficient, easier to implement, and less error-prone than ... layout analysis .... [and] we are able to show very competitive runtimes to the fastest known but more error-prone schemes like I-Match and Locality Sensitive Hashing (LSH).

The points (or "spots") in the page at which spot signatures are generated are all the locations where ... anchor words [occur]. We call the anchor words antecedents, which are typically chosen to be frequent within the corpus. The most obvious, largely domain-independent choices ... are stopwords like is, the, do, have, etc. ... which are distributed widely and uniformly within any snippet of natural-language text.

A spot signature ... consists of a chain of words that follow an antecedent word .... Spot signatures ... [focus on] the natural-language passages of Web documents and skip over advertisements, banners, and navigational components.
The paper gives an example of a generating a spot signature where a piece of text like "at a rally to kick off a weeklong campaign" produces two spots: "a:rally:kick" and "a:weeklong:campaign".

What I particularly like about this paper is that they take a very hard problem and find a beautifully simple solution. Rather than taking on the brutal task of tearing apart the page layout to discard ads, navigation, and other goo, they just noted that the most important part of the page tends to be natural language text. By starting all their signatures at stopwords, they naturally focus the algorithm on the important parts of the page. Very cool.

If you can't get enough of this topic, please see also my previous post, "Detecting near duplicates in big data", that discusses a couple papers out of Google that look at this problem.

Update: Martin Theobald posted a detailed summary of the paper on the Stanford InfoBlog.


Anonymous said...

"By starting all their signatures at stopwords, they naturally focus the algorithm on the the important parts of the page."

Ironically you have a duplicate stopword in there.

Greg Linden said...

Oops! Thanks, fixed now!

Abdur said...

That is a nice piece of work. When I came up with I-match I was trying to focus the duplicate signatures at the important pieces of the document/information used via IDF filtering. I like how they are using this approach to refine what is used to find a duplicate rather than the other works that simply try and find ways to project the problem into some similarity space.