Wednesday, May 14, 2008

Yahoo, Hadoop, and Pig Latin

Chris Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins from Yahoo have an upcoming paper at SIGMOD 2008, "Pig Latin: A Not-So-Foreign Language for Data Processing" (PDF), that details Yahoo's work to build a powerful parallel processing language on top of Hadoop.

Some excerpts:

We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce.

At a growing number of organizations, innovation revolves around the collection and analysis of enormous data sets such as web crawls, search logs, and click streams ... For example, the engineers who develop search engine ranking algortihms spend much of their time analyzing search logs looking for exploitable trends.

A Pig Latin program is a sequence of steps ... each of which carries out a single data transformation ... Writing a Pig Latin program is similar to specifying a query execution plan ... This method is much more appealing than encoding [a] task as an SQL query, and then coercing the system to choose the desired plan through optimizer hints.

Pig ... is fully implemented and available as ... open-source. [Pig is executed] on Hadoop, an open-source, scalable map-reduce implementation. Pig has an active and growing user base inside Yahoo! and ... [is] beginning to attract users in the broader community.
The paper provides a number of examples of what Pig code looks like and how it executes across a cluster. The related work section of the paper is excellent and should not be missed; it compares Pig, Bigtable, MapReduce, map-reduce-merge, Dryad, and Sawzall.

Please see also my previous posts on Pig and Yahoo's Hadoop clusters, "Yahoo Pig and Google Sawzall", "Hadoop Summit notes", and "Yahoo deploys large scale Hadoop cluster".

Advertising, search, and drive-by malware

Googlers Niels Provos, Panayiotis Mavrommatis, Moheeb Rajab, and Fabian Monrose have an amusingly titled 2008 tech report, "All Your iFrames Are Point to Us" (PDF), with frightening results on the sophistication and ubiquity of malware distribution networks.

Some excerpts:

Various browser vulnerabilities [can] automatically download and run -- i.e. unknowingly to the visitor -- [a malware] binary upon visiting a website ... The potential victim base from these so-called drive-by downloads can be far greater than other forms of exploitation because traditional defenses (e.g. firewalls, dynamic addressing, proxies) pose no barriers to infection.

Malware serving networks are composed of tree-like structures ... [that] deliver the malware to the victim after a number of indirection steps ... to lure users .... Our results reveal that ad serving networks are increasingly being used as hops in the malware distribution chain.

About 0.6% of the top million URLs that appeared most frequently in Google's search results led to exposure to malicious activity at some point.

In a set of 2,000 well known advertising networks ... 2% of the landing sites were delivering malware via advertisements ... [often with] more than 6 redirection steps.

1.3% of the incoming search queries to Google's search engine return at least one link to a malicious site [either in the results or in an ad].
The paper goes on to describe the honeypot they used to detect malware, some of the properties of the malware, more detail and examples on the malware distribution networks and how they obscure the malicious final landing site, and how antivirus products fail to detect much of the malware.

Please see also "Ghost turns Zombie: Exploring the Life Cycle of Web-based Malware" (PDF), a fascinating USENIX 2008 paper by some of the same authors that details the "post-infection network behavior of web-based malware". Some of the botnets "demonstrated surprising sophistication" including sending "memory dumps of failed installations" back to the malware developers.

Please see also Niels Provos' post on the Official Google Security Blog and a post by Bruce Schneier, both on the first paper.

Friday, May 09, 2008

Starting Findory: Funding

[This is a continuation of the posts in my Starting Findory series describing my experiences building my first company, Findory.]

I had the wrong strategy with trying to get funding for Findory. I pursued VCs instead of angels.

I should have realized, VCs would never fund Findory. From their point of view, there was just too much risk.

Findory was lead by a first time entrepreneur and was missing critical expertise in building a company. Findory lacked a full team; in particular, there was no finance, marketing, or business development talent. And, Findory had an unproven technology -- a specific technique for personalizing news, search, and advertising -- that is difficult for a VC firm to evaluate.

Had any one of these three been different, there might have been a better chance. A proven founder may be able to get funding with nothing more than an idea on a napkin. A strong team ready to go is an attractive investment for a VC. A proven technology that just needs the wind of capital behind it is an easy bet. But, with all three missing, there was never much of a chance.

Perhaps in the frothier SF Bay Area, this might have been different. Perhaps someone might have taken a gamble. In Seattle, there are only a half a dozen VCs doing substantial internet investing; most firms out of Seattle will not fund without a local firm joining in the round.

I did pitch dozens of venture capitalists in and outside of Seattle on Findory. It was a fascinating experience, educational, exciting, and often humbling. But, ultimately, I have to say it was a wasteful distraction from building things to help Findory readers.

I should have focused on angels. Not only would angels have brought the right amount of cash for short-term needs, but also the addition of experienced angels to the board would have provided valuable advice and connections. It would have been just what Findory needed.

I had thought that, with how far I had bootstrapped Findory, that I could skip a step and jump directly to VCs. That was foolish. Findory remained a long way from the point where the momentum would be so obvious that it would overwhelm the other concerns.

Instead of looking to VCs, I should have looked for an experienced angel, someone passionate about personalizing information, who would have given the startup the advice it needed while enjoying the experience of joining with us to grow the company.

Please see also Paul Graham's article, "A Hacker's Guide to Investors".

Please see also Paul Graham's article, "Why There Aren't More Googles", especially his discussion of VCs as tending to be surprisingly "conservative" and "driven by consensus".

Please also see the other posts in the Starting Findory series.

Scaling Facebook's databases

Impressive numbers from Facebook on their architecture: 1,800 MySQL servers (900 pairs of master/slave) holding a heavily partitioned data set managed by just 2 DBAs.

Failures are handled by promoting a slave to master and then putting in a new slave quickly. It all must be heavily scripted and automated to be managed by just two people. Very nice.

Please see also the garbled but tolerable video from the panel, "Scaling MySQL -- Up or Out?", from the 2008 MySQL Conference where these numbers were disclosed.

I have to say, YouTube looked pretty bad sitting up there on the panel refusing to provide any numbers at the same time Facebook was being so forthcoming. But, Cuong Do Cuong's talk last year, "YouTube Scalability", from the Google Scalability Conference has some nice details on YouTube's architecture and lessons learned, though no specific numbers.

Please see also comments on the numbers from James Hamilton and Om Malik.

Netflix-KDD Workshop

The "Second KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition" will be on August 24 in Las Vegas.

Last year's workshop was a gold mine of work on large scale recommender systems using the Netflix data set. It promises to be interesting again this year.

Learning to love customers like you

Michael Schrage at MIT Technology Review writes a fun article, "Recommendation Nation: Learning to love customers like you".

Some excerpts:

Recommendations are everywhere on the Internet.

Pop-ups and context-sensitive advertisements have been supplemented by this low, seductive whisper of automated suggestion. The truth is that I now get more good recommendations about more things, more often, from Bayesian algorithms than from my best friends.

Perhaps this should make me wistful, but it doesn't. Better tech­nology doesn't mean worse friends.

Unlike human recommenders, Apple.com, ­Amazon.com, and Google.com never insult me by implying that I spend my time, money, or attention on the wrong things. They're simply making relevant--and occasionally novel--recommendations based on my past choices and the things I attend to in real time.

The focus of digital personalization has shifted from what I am interested in now to what I might be interested in next. All the choices I make in the moment are absorbed into a sphere of suggestion where, after they have been statistically weighted, they are reborn as offers and advice.
The article goes on to talk about several ways recommendations could be improved so they would be more helpful.

I am briefly quoted in the article talking about the early motivation behind building Amazon.com's recommendations.

Wednesday, May 07, 2008

Crawling is harder than it looks

The best paper award at WWW 2008 went to a paper on large-scale crawling titled "IRLbot: Scaling to 6 Billion Pages and Beyond" (PDF) by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov.

I have never been that interested in crawling, so I almost missed this talk. But, I am glad I didn't.

The paper is a fascinating look at the difficulty of doing very large scale web crawling, focusing on avoiding web spam and pathological link structures that could trap a crawler, making sure to not overwhelm crawled webservers, and using high performance techniques for duplicate detection.

Some extended excerpts:

The web has changed significantly since the days of early crawlers, mostly in the area of dynamically generated pages and web spam. With server-side scripts that can create infinite loops, high-density link farms, and unlimited number of hostnames, the task of web crawling has changed from simply doing a BFS scan of the WWW to deciding in real time which sites contain useful information and giving them higher priority as the crawl progresses.

The first performance bottleneck we faced was caused by the complexity of verifying uniqueness of URLs and their compliance with robots.txt. As N scales into many billions ... [prior] algorithms ... no longer keep up with the rate at which new URLs are produced by our crawler (i.e., up to 184K per second) ... [A] new technique called Disk Repository with Update Management (DRUM) ... can store large volumes of arbitrary hashed data on disk and implement very fast check, update, and check+update operations using bucket sort ... DRUM can be thousands of times faster than prior disk-based methods.

In order to determine the legitimacy of a given domain, we use a very simple algorithm based on the number of incoming links from assets that spammers cannot grow to infinity. Our algorithm, which we call Spam Tracking and Avoidance through Reputation (STAR), dynamically allocates the budget of allowable pages for each domain and all of its subdomains in proportion to the number of in-degree links from other domains.

We ran IRLbot on a [single] quad-CPU AMD Opteron 2.6 GHz server (16 GB RAM, 24-disk RAID-5) attached to a 1-gb/s link ... [for a] total active crawling span of 41.27 days. During this time, IRLbot attempted 7,606,109,371 connections and received 7,437,281,300 valid HTTP replies ... IRLbot ended up with N = 6,380,051,942 responses with status code 200 and content-type text/html.

The average download rate during this crawl was 319 mb/s (1,789 pages/s) with the peak 10-minute average rate of 470 mb/s (3,134 pages/s). The crawler received 143 TB of data, out of which 254 GB were robots.txt files, and transmitted 1.8 TB of HTTP requests. The parser processed 161 TB of HTML code.
At very large scale and from a single box, those are pretty remarkable crawling rates, 2k pages/second. The details of the bottlenecks they encountered and how they overcame them made this paper a quite enjoyable read.

I did end with a question about part of the work. The authors say (in 7.1 and Figure 6) that the probability of finding a unique page never dropped below .11 and, earlier (in 1.4) say that "we believe a good fraction of the 35B URLs not crawled in this experiment [contain] useful content." However, they define a unique page as the same URL, so it seems like it could easily be the case that those 35B URLs seen but not crawled could have duplicate or near duplicate content to the 6B crawled.

Update: One topic the IRLbot paper ignored was how frequently we should recrawl a page. Another WWW 2008 paper, "Recrawl Scheduling Based on Information Longevity" (PDF) by Chris Olston and Sandeep Pandey, has a nice overview of that issue and then extends the prior work by focusing on the persistence of a web page. The authors end up arguing that it is not worth recrawling pages that change very rapidly very frequently because it is impossible to ever get the index to match the actual content of the page.

Update: On the question of what content is useful, both in terms of crawling and recrawling, I always find myself wondering how much we should care about pages that never get visited by anyone. Shouldn't our focus in broadening a crawl beyond 6B pages and in recrawling pages in our index depend on how much users seem to be interested in the new content?

Finding the location of interests and objects from search logs

A paper at WWW 2008, "Spatial Variation in Search Engine Queries" (PDF), by Lars Backstrom, Jon Kleinberg, Ravi Kumar, and Jasmine Novak offered many clever examples of using where people are when they do a web search both to determine when interest in a topic is geographically isolated and to estimate the physical location of objects.

The most dramatic example the latter in the paper was estimating the location of a hurricane over time by the queries for the name of the hurricane. It worked surprisingly well.

I intended to write much more about this paper, but Danny Sullivan already covered much of what I planned to say in his excellent post, "Yahoo Paper: Finding The Local 'Center' Of Search Queries".

Danny excerpts many of the pictures from the article and comes to the same conclusion that I did, that it is fascinating, motivating paper, but it leaves open the question how to apply this to improve local search.

Please see also the papers at the LocWeb Workshop from WWW 2008, several of which also looked at using IP addresses in search logs to determine local intent.

Thursday, May 01, 2008

Random walks of the click graph

I recently have been coming across several interesting efforts that use random walks of a click graph for relevance or keyword suggestions in search and advertising.

The basic idea is to build a bipartite graph with queries on one side, documents or ads on the other, and weighted links representing clicks after making the query on the documents or ads. Then, we start at one point of the graph, randomly choose a link to walk across, and keep doing that for several steps, seeing where we end up most frequently.

One nice example of this is the WWW 2008 short paper "Simrank++: Query Rewriting through Link Analysis of the Click Graph" (PDF + tech report PDF) by Ioannis Antonellis, Hector Garcia-Molina, and Chi-Chao Chang.

An excerpt from the longer tech report:

We focus on query rewrites based on the recent history of ads displayed and clicked on.

The back-end generates a historical click graph that records the clicks that were generated by ads when a user inputs a given query. The click graph is a weighted bi-partite graph, with queries on one side and ads on the other.

Our techniques identify not only queries that are directly connected by an ad (e.g., users that submit either "mp3" or "i-tunes" click on ad an for "iPod") but also queries that are more indirectly related.

Notice that our query rewriting problem is a type of collaborative filtering (CF) problem. We can view the queries as "users" who are recommending "ads" by clicking on them. When we identify similar queries, we are finding queries that have similar recommendations, just like in CF, where one finds users that have similar tastes.
Another short paper at WWW 2008, this one by Martin Szummer and Nick Craswell, "Behavioral Classification on the Click Graph" (PDF), looked at using random walks of the click graph for classifying adult websites.

An excerpt:
We present a method for exploiting user browsing activity to
classify web pages ... as adult (inappropriate for minors) or not.

When a user types a query and then clicks a search result, they create a query-URL association. By logging a large number of click events, the search engine can amass a large number of query-URL pairs. These can be viewed as a bipartite graph, where each query is adjacent to one or more URLs and each URL is adjacent to one or more queries.

Our method exploits the structure of this graph. We implicitly look for clusters of nodes (representing queries or URLs) that share clicks to/from each other. We assume that nodes that cluster well are likely to belong to the same underlying category.

The implicit clustering is done via a Markov random walk on the graph. The walk captures the transitivity of class similarity on the graph: if A is co-clicked with B and B is co-clicked with C, then A is also likely to be related to C.
To me, this had the feel of a click graph-based version of TrustRank (PDF), propagating reputation across a click graph instead of a link graph.

Please see also Nick and Martin's SIGIR 2007 paper, "Random Walks on the Click Graph" (PDF), which looks using click graph random walks for relevance rank in image search.

Please see also yet another WWW 2008 paper, "Using the Wisdom of the Crowds for Keyword Generation" (PDF), by Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, and Rakesh Agrawal, which also looks at using random walks of the click graph in order to suggest keywords for advertising.

Size matters? Or simplicity?

An amusing WWW 2008 poster by Joshua Blumenstock, "Size Matters: Word Count as a Measure of Quality on Wikipedia" (PDF) found that a very simple measure of "quality" of article on Wikipedia, the number of words in the article, performed nearly as well as much more complicated classifiers.

Word count with a simple threshold achieved 96.5% accuracy in the classification task. The other more complicated techniques ranged from 81-98% accuracy.

Joshua is careful in the paper to not "exaggerate the importance of this metric", but I believe there is a lesson here: Try the simple things first. It can often be surprising how well something simple works.

Monday, April 28, 2008

Questioning Yahoo Answers

Zoltan Gyongyi, Georgia Koutrika, Jan Pedersen, and Hector Garcia-Molina published a paper at WWW 2008, "Questioning Yahoo Answers", that has an interesting analysis of how people are using Yahoo Answers.

Some excerpts:

The majority of users are askers ... [Only] a small portion of the user population that provides answers or votes .... In practice, a large fraction of best answers are of questionable value .... Most answerers are not domain experts focusing on a particular topic, but rather individuals with diverse interests quite eager to explore questions of different nature.

Many questions are meant to trigger discussions, encourage the users to express their opinions, etc. The expected answers are subjective. Thus, in a traditional sense, there is no bad or good (or best) answer to such "questions." Consider, for example, "What do you think of Red Hot Chili Peppers?"

Note, however, that Yahoo! Answers was not designed with discussions in mind, and so it is an imperfect medium for this purpose. Users cannot answer their own questions, thus cannot participate in a discussion. Similarly, a user may only post at most one answer to a question. Yahoo! Answers has no support for threads that would be essential for discussions to diverge.

Much of the interaction on Yahoo! Answers is just noise. People post random thoughts as questions, perhaps requests for instant messaging. With respect to the latter, the medium is once again inappropriate because it does not support real-time interactions.

This evidence and our results point to the following conclusion: A question answering system, such as Yahoo! Answers, needs appropriate mechanisms and strategies that support and improve the question-answering process per se. Ideally, it should facilitate users finding answers to their information needs and experts providing answers.
Please see also my October 2007 post, "Revisiting Yahoo Answers", where I look at Yahoo Answers not as a question answering site, but as what it primarily seems to be used for, a discussion forum.

Search trails and relevance

Misha Bilenko and Ryen White from Microsoft Research had a paper at WWW 2008, "Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites From User Activity" (PDF), that is a fascinating look at going beyond the first click on search results for improving search results and toward considering all the pages people visit.

An excerpt from the paper:

While query and clickthrough logs from search engines have been shown to be a valuable source of implicit supervision for training retrieval methods, the vast majority of users' browsing behavior takes place beyond search engine interactions.

This paper proposes exploiting a combination of searching and browsing activity of many users to identify relevant resources for future queries. To the best of our knowledge, previous approaches have not considered mining the history of user activity beyond search results, and our experimental results show that comprehensive logs of post-search behavior are an informative source of implicit feedback for inferring resource relevance.

Web browser toolbars have become increasingly popular ... Examples of popular toolbars include those affiliated with search engines (e.g., Google Toolbar, Yahoo! Toolbar, and Windows Live Toolbar) ... Most popular toolbars log the history of users' browsing behavior on a central server for users who consented to such logging. Each log entry includes an anonymous session identifier, a timestamp, and the URL of the visited Web page. From these and similar interaction logs, user trails can be reconstructed.

Training retrieval algorithms on interaction behavior from navigation trails following search engine result click-through leads to improved retrieval accuracy over training on only result click-through or search destinations ... Our research has profound implications for the design of Web search ranking algorithms and the improvement of the search experience for all search engine users.
Please see also Googler Daniel Russel's JCDL 2007 talk, "What are they thinking? Searching for the mind of the searcher" (PDF), which shows, starting on slide 33, that Google is using their toolbar data to analyze user behavior.

Thanks, Ionut Alex Chitu, for the pointer to Daniel's JCDL talk.

Keynotes at WWW 2008

The keynote talks at WWW 2008 promised to be exciting. With names like Google VP Kai-Fu Lee (who's departure from Microsoft famously revealed Ballmer's hidden passion for hurling chairs), Microsoft VP Harry Shum, inventor of the Web Tim Berners-Lee, and AT&T Chief Scientist David Belanger, how could you go wrong?

Surprisingly, it did go wrong. Everyone I talked to gave reviews of the keynotes that ranged from lukewarm to blisteringly negative.

The general conclusion, which also agrees with my perception, seemed to be that Kai-Fu Lee and Harry Shum gave pitches for Google and Microsoft rather than vision. And, in the Great Hall of the People, Tim Berners-Lee seemed woefully unprepared, rambling with such a lack of coherent theme that many buried themselves in alcohol to escape the pain.

Nevertheless, there were tidbits that could be extracted. Comparing Kai-Fu Lee's and Harry Shum's talks, I was struck by the way Harry Shum focused entirely on search while Kai-Fu seemed to be looking beyond search.

Kai-Fu talked mostly about cloud computing and showed Google products as examples of a shift toward cloud computing. For example, Kai-Fu pointed at GMail and Google Docs as applications moving to the browser, and integration of those applications as examples of how the browser can make applications more "task-focused" (an argument I found quite weak, by the way, even though I tend to support the ideas behind cloud computing).

Harry Shum focused on improving search, noting that "40% of queries go unanswered" and "50% require refinement", and then saying we need to look at concepts, topics, entities, and tasks to try to figure out "what the user really wants to do".

On the one hand, both talks seemed odd to me that the search giant Google offered a keynote at a Web conference not about search whereas Microsoft used their keynote to talk about little else.

On the other hand, both talks may have been looking at what the companies need to do to expand beyond where they currently are on the Web. Google needs to move beyond search and advertising. Microsoft needs to get into search and advertising. In that framework, the two speeches make a lot of sense.

David Belanger's sparsely attended talk was about how we can integrate data across three distinct interfaces, mobile, PC, and television. Unlike some who claim these three are converging, David argued that TV, PC, and mobile will stay separate and the key question will be interfacing them with each other. I was pleased that David pushed recommendations for information as necessary "in the not to distant future" when we have "access to any video ever made" and only weak ability to specify our interests beyond a desire to watch something fun and interesting. Sadly, he did not expand much on that point, instead devoting much of the talk to promoting AT&T and discussing low level networking details of data sharing and synchronization.

It is hard to say much about Tim's talk other than that he was enthusiastic about globalization, internationalization, and the semantic web. Tim pointed that Chinese was poised to become the dominant language on the Web and urged everyone to think about the changes that would cause. He promoted the semantic web, but offered nothing new. The remainder of talk was so badly unstructured -- nearly a stream of consciousness -- that it was hard to follow despite the passion he tried to put behind it.

Friday, April 18, 2008

GoogleBot starts on the deep web

Jayant Madhavan and Alon Halevy have the post on the Official Google Blog, "Crawling through HTML forms", announcing that the crawler will start submitting queries in forms in order to pull data out of databases that otherwise would not be visible.

Please see also my March 2007 posts, "The end of federated search?" and "Google and the deep web", which discuss several papers out of Google (that Jayant and Alon authored) that appear to describe Google's thoughts on what to do about the deep web.

Mousetracking in web search

Googlers Kerry Rodden, Xin Fu, Anne Aula, and Ian Spiro had a short paper last week at CHI 2008, "Eye-mouse coordination patterns on web search results pages" (ACM site), with some some very nice graphs showing common behaviors using the mouse on web search results and how they correspond to eye movements over the page (see Figure 1, 2, and 3).

Unfortunately, their results seem to indicate that mouse tracking is only weakly correlated with eye tracking, so mouse tracking might be less useful than one would hope for determining which items on a web search result page were never even seen by a searcher.

For more on using mouse tracking as a replacement for eye tracking to understand what people look at on web search result pages, please see my earlier post, "Cheap eyetracking using mouse tracking", and its discussion of a SIGIR 2007 paper by some of the same Google authors.

Udi Manber interview in Popular Mechanics

Google VP Udi Manber has an interesting interview in Popular Mechanics.

An excerpt on learning to rank from searcher behavior:

Our goal is very simple: We want to return to the user the answer that they need.

The results we show you are based not only on what we know of the Web, but also what other people have searched for .... Signals from people are the best signals.
An excerpt on personalization:
If you allow us to keep your Web history, we will improve your search ... in two ways. One is, we will tune the result for you slightly. We're not going to change the whole page -- we might change position 5 to position 3 here and there, but we'll use whatever we can from your previous searches to adapt the current search to you.

The second is, we allow you to search within your Web history ... You may remember something you did three months ago and you don't remember exactly how you did it.
And, an excerpt on how improving overall relevance requires accumulating many small twiddles that apply to specific types of searches and needs:
Last year we made over 450 improvements to the algorithm .... We have to find what weakness in the algorithm caused that result and find a general solution to that, evaluate whether a general solution really works and if it's better, and then launch a general solution.

I'll give you an example of something that came last week. We were evaluating a certain algorithm that adds diversity to the result. We did live experiments, which means we launched the algorithm to a very small percentage of users and then see how that compares to the result without the algorithm.

One of the queries that made a difference: The query was, New York Times address ... The first result right there on the snippet gives you The New York Times. It turns out that's not what the user was looking for. They were looking for an address given out by a New York Times reporter the day before. And because of this diversity and because of our emphasis on freshness and highlighting fresh results, that particular address appeared somewhere in the results, and that's what the user wanted -- that's what they went to and got the result.

That was something that surprised even us. You don’t think that when someone searches for New York Times address that they’re not looking for the address. Language is like that. Intention can be ambiguous.
There is a re-occurring theme in Udi's comments, learning from searcher's behavior. Google learns from what people searched for. Google learns from a searcher's history. Google learns what works and what does not from what people click on. Google learns from what people do when they use Google.

For more on that, please see also my previous post, "Actively learning to rank".

Please see also my earlier post, "The perils of tweaking Google by hand", which talks about whether these thousands of twiddles to the search engine and variations of them probably should be constantly tested rather than just evaluated at the time they are created.

Wednesday, April 16, 2008

Detecting near duplicates in big data

I finally got to a WWW 2007 paper out of Google I have been meaning to read, "Detecting Near-Duplicates for Web Crawling" (PDF) by Gurmeet Manku, Arvind Jain, and Anish Sarma.

The paper takes more theoretical work from Moses Charikar back in 2002, "Similarity Estimation Techniques from Rounding Algorithms" (ACM site), which describes a form of locality sensitive hashing, and applies it at very large scale (8B documents), dealing with all the practical issues along the way.

In that sense, this Google WWW 2007 paper has a lot of similarities with Monika Henzinger's excellent SIGIR 2006 paper, "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms" (ACM site). That paper evaluated the shingles method of creating fingerprints of substrings from a document against Charikar's technique on a 1.6B document data set and found the latter to be superior.

Some excerpts from the Google WWW 2007 paper:

Documents that are exact duplicates of each other ... are easy to identify ... A more difficult problem is the identification of near-duplicate documents ... [that] differ in a small portion of the document such as advertisements, counters and timestamps.

Charikar's simhash ... is a fingerprinting technique that enjoys the property that fingerprints of near-duplicates differ in a small number of bit positions.

We maintain an f-dimensional vector V, each of whose dimensions is initialized to zero. A feature is hashed into an f-bit hash value. These f bits (unique to the feature) increment/decrement the f components of the vector by the weight of that feature as follows: if the i-th bit of the hash value is 1, the i-th component of V is incremented by the weight of that feature; if the i-th bit of the hash value is 0, the i-th component of V is decremented by the weight of that feature. When all features have been processed, some components of V are positive while others are negative. The signs of components determine the corresponding bits of the final fingerprint.

Simhash possesses two conflicting proporties: (A) The fingerprint of a document is a "hash" of its features, and (B) Similar documents have similar has values.
The paper goes on to describe how they efficiently find nearby but different hash values that represent similar documents, how they rapidly build the simhash using MapReduce, and metrics on the effectiveness of the technique.

Going to WWW

I will be at the WWW 2008 Conference in Beijing next week.

Very much looking forward to it! If you see me there, please say hello!

Google engEdu talk on Bloom filters

Ely Porat gave a Google engEdu talk, "The Bloom filter", with a good survey of Bloom filters and variants on Bloom filters.

Bloom filters let you quickly determine if an item is a member of a set with no false negatives but some false positives. They work well for very large data sets since they require about 10 bits per item for a 1% false positive rate and offer constant time inserts and lookups.

Ely's talk provides some useful motivation at the beginning, then has a nice, light introduction to Bloom filters starting at 8:40. The discussion of variants that support deletion or compression is interesting but, if you find yourself dragging, make sure to at least jump to 30:40 for the cute discussion of Cukoo Hashing and the Cukoo Bloom Filter.

Please see also the "Bloom filter" Wikipedia page which contains a nice summary of Bloom filters and many of the variants described in Ely's talk.

Tuesday, April 15, 2008

Contextual advertising and social networks

Anand Rajaraman makes a counter-intuitive but compelling argument that having a close relationship between people who view a page reduces the value of the page for contextual advertising.

An excerpt:

Consider the difference between a Facebook profile and a TripAdvisor travel review. A typical [Facebook] pageview ... is by someone known very well to the creator of the profile – a close friend or acquaintance ... A TripAdvisor travel review is seen by people completely unrelated in any way to the person or persons who wrote the reviews on the page.

The "affinity" of a social media service is the average closeness of relationship between a content creator and someone who views that content. The affinity of Facebook is very high, while the affinity of TripAdvisor is very low.

There is an inverse relationship between the affinity of a social media service and its targetability. Why is this true? The act of viewing a Facebook profile gives us very little information about the viewer, other than the fact that she is friends with the profile creator; when someone views a TripAdvisor travel review, she is definitely interested in traveling to that location.
When a page draws from a wide audience, the people who manage to get to that page often will come from search and will be interested in the content of that page. So, they can easily be targeted by targeting the content of the page.

When a page draws from a narrow audience, like a group of friends, those people are not as interested in the content as the person. So, targeting the content, as contextual advertising does, will be less effective than targeting the people.

I have said before that lack of commercial intent is the problem on social networking sites, but Anand makes the good point that diversity of interest may be a better way of describing the problem. By summarizing why people come to a page, affinity may explain the difficulty of targeting advertising to that page.