tag:blogger.com,1999:blog-6569681.post8174355082328430810..comments2024-03-24T10:38:16.997-07:00Comments on Geeking with Greg: Optimizing broad match in web advertisingGreg Lindenhttp://www.blogger.com/profile/09216403000599463072noreply@blogger.comBlogger1125tag:blogger.com,1999:blog-6569681.post-2278802648056004382009-06-18T14:43:32.397-07:002009-06-18T14:43:32.397-07:00The record linkage literature is full of blocking ...The record linkage literature is full of blocking strategies like the first one you described (must match on a low DF word). <br /><br />If you need to match each word, then search engines will often do intersections starting with lowest DF terms so searches fail fast.<br /><br />Google's actually slow if you give it half a dozen medium to low frequency words that show up in only a few documents together. For instance, I just tried [frappe moka banquette pomerol], which had 42 total hits and took 0.6s.<br /><br />Nutch, for instance, has a built-in option to do phrase search for common stop-words for just the reasons this mentions (so it's no surpirse Yahoo! is thinking about it, because Doug Cutting, lead developer for Nutch, works there).Bob Carpenterhttp://lingpipe-blog.com/noreply@blogger.com