tag:blogger.com,1999:blog-6569681.post307542365691869235..comments2024-03-24T10:38:16.997-07:00Comments on Geeking with Greg: Detecting near duplicates in big dataGreg Lindenhttp://www.blogger.com/profile/09216403000599463072noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-6569681.post-84895172480786537582011-10-06T23:17:37.032-07:002011-10-06T23:17:37.032-07:00Interesting.. I would like to know what the weight...Interesting.. I would like to know what the weight of the feature means and how to calculate it. If I understand correctly a feature can be just a token from the documentJDhttps://www.blogger.com/profile/13339646748146078328noreply@blogger.comtag:blogger.com,1999:blog-6569681.post-6293669489119543962009-02-12T07:56:00.000-08:002009-02-12T07:56:00.000-08:00Greg,Am I right that Google's paper basically sayi...Greg,<BR/><BR/>Am I right that Google's paper basically saying that the most efficient way to find out near-duplicate documents is:<BR/>count number of matching triplets in two documents?<BR/>(Triplet -- 3-words phrase).Dennis Gorelikhttps://www.blogger.com/profile/17700219093521377626noreply@blogger.comtag:blogger.com,1999:blog-6569681.post-10707205971195174172008-04-17T06:00:00.000-07:002008-04-17T06:00:00.000-07:00Very elegant technique for dimension reduction wit...Very elegant technique for dimension reduction with document similarity.<BR/><BR/>Google also patented it: Methods and apparatus for estimating similarity (US Patent 7,158,961)jeff.daltonhttps://www.blogger.com/profile/12887721174386884522noreply@blogger.com