Wednesday, September 07, 2011

Blending machines and humans to get very high accuracy

A paper by six Googlers from the recent KDD 2011 conference, "Detecting Adversarial Advertisements in the Wild" (PDF) is a broadly useful example of how to succeed at tasks requiring very high accuracy using a combination of many different machine learning algorithms, high quality human experts, and lower quality human judges.

Let's start with an excerpt from the paper:
A small number of adversarial advertisers may seek to profit by attempting to promote low quality or untrustworthy content via online advertising systems .... [For example, some] attempt to sell counterfeit or otherwise fraudulent goods ... [or] direct users to landing pages where they might unwittingly download malware.

Unlike many data-mining tasks in which the cost of false positives (FP's) and false negatives (FN's) may be traded off, in this setting both false positives and false negatives carry extremely high misclassification cost ... [and] must be driven to zero, even for difficult edge cases.

[We present a] system currently deployed at Google for detecting and blocking adversial advertisements .... At a high level, our system may be viewed as an ensemble composed of many large-scale component models .... Our automated ... methods include a variety of ... classifiers ... [including] a single, coarse model ... [to] filter out .. the vast majority of easy, good ads ... [and] a set of finely-grained models [trained] to detect each of [the] more difficult classes.

Human experts ... help detect evolving adversarial advertisements ... [through] margin-based uncertainty sampling ... [often] requiring only a few dozen hand-labeled examples ... for rapid development of new models .... Expert users [also] search for positive examples guided by their intuition ... [using a custom] tool ... [and they have] surprised us ... [by] developing hand-crafted, rule-based models with extremely high precision.

Because [many] models do not adapt over time, we have developed automated monitoring of the effectiveness of each ... model; models that cease to be effective are removed .... We regularly evaluate the [quality] of our [human experts] ... both to access the performance of ... raters and measure our confidence in these assessments ... [We also use] an approach similar to crowd-sourcing ... [to] calibrate our understanding of real user perception and ensure that our system continues to protect the interest of actual users.
I love this approach, blending experts and the human intuition of experts to help guide, assist, and correct algorithms running over big data. These Googlers used an ensemble of classifiers, trained by experts that focused on labels of the edge cases, and ran them over features extracted from a massive data set of advertisements. They then built custom tools to make it easy for experts to search over the ads, follow their intuition, dig in deep, and fix the hardest cases the classifiers missed. Because the bad guys never quit, the Googlers not only constantly add new models and rules, but also constantly evaluate existing rules, models, and the human experts to make sure they are still useful. Excellent.

I think the techniques described here are applicable well beyond detecting naughty advertisers. For example, I suspect a similar technique could be applied to mobile advertising, a hard problem where limited screen space and attention makes relevance critical, but we usually have very little data on each user's interests, each user's intent, and each advertiser. Combining human experts with machines like these Googlers have done could be particularly useful in bootstrapping and overcoming sparse and noisy data, two problems that make it so difficult for startups to succeed on problems like mobile advertising.

1 comment:

jeremy said...

I love this approach, blending experts and the human intuition of experts to help guide, assist, and correct algorithms running over big data. These Googlers used an ensemble of classifiers, trained by experts that focused on labels of the edge cases, and ran them over features extracted from a massive data set of advertisements. They then built custom tools to make it easy for experts to search over the ads, follow their intuition, dig in deep, and fix the hardest cases the classifiers missed. Because the bad guys never quit, the Googlers not only constantly add new models and rules, but also constantly evaluate existing rules, models, and the human experts to make sure they are still useful. Excellent.

Isn't this exactly the stance I was advocating 5-6 years ago, when we were having our "tools vs. personalization" debates? Granted, this approach described above is for use only by Googlers internally, to improve their own system, whereas I want to be able to use similar tools for general web search, so as to really be able to dig deep on a particular topic of interest. So maybe we still have disagreement on the scope of the applicability. But the general idea, that of a machine/system/learning algorithm that supports active human exploration, interaction, and non-lazy investigation, is exactly the approach I've wanted for years now.