Comments on Geeking with Greg: Detecting near duplicates in big data

Interesting.. I would like to know what the weight...

2011-10-06T23:17:37.032-07:00

Interesting.. I would like to know what the weight of the feature means and how to calculate it. If I understand correctly a feature can be just a token from the document

Greg,Am I right that Google's paper basically sayi...

2009-02-12T07:56:00.000-08:00

Greg,

Am I right that Google's paper basically saying that the most efficient way to find out near-duplicate documents is:
count number of matching triplets in two documents?
(Triplet -- 3-words phrase).

Very elegant technique for dimension reduction wit...

2008-04-17T06:00:00.000-07:00

Very elegant technique for dimension reduction with document similarity.

Google also patented it: Methods and apparatus for estimating similarity (US Patent 7,158,961)