Geeking with Greg: 05/01/2009

Thursday, May 28, 2009

Danny Sullivan on Microsoft search

Danny Sullivan has a couple great posts ([1] [2]) on Microsoft renaming its search engine to Bing, the new features and improvements in relevance coming along with the launch, and what it might mean for Google and Yahoo.

Danny is wise in the ways of search. His words are well worth reading.

Update: To be clear, just because I think Danny's posts are well worth reading doesn't mean I agree with him on everything. In fact, quite the opposite, I usually find the most interesting articles to be the ones that make a convincing argument on something with which I disagree.

One particular point of disagreement I have with Danny is on whether Microsoft's Bing needs to be different or better than Google to succeed. Many, including Danny, have said that being good enough isn't enough because of the power of Google's brand. I'm not sure that's true.

Until now, we haven't been able to tell who is right since Microsoft's search has been noticeably weaker than Google's. But, as Rafe Needleman noted, as did Danny, the name change to Bing comes with substantial improvements to relevance.

So, I think we're about to get the first real test of whether being about as good as Google is enough to see people using Microsoft's search engine.

Update: A week later, Danny Sullivan has a Q&A with Nick Eaton. Interesting how low Danny sets the bar for success for Microsoft's Bing, saying, "They're roughly at 10 percent [market share], I think that they could count themselves really successful if they got themselves to 15 or 20."

Update: Three weeks later, an article in the NY Post claims, "Co-founder Sergey Brin is so rattled by the launch of Microsoft's rival search engine that he has assembled a team of top engineers to work on urgent upgrades to his Web service." Frankly, I find that a little hard to believe, at least as phrased. Seeing as Microsoft is getting closer, I would expect Googlers are redoubling their efforts to stay ahead in core search, but I doubt that Sergey, Larry, or Eric are deeply rattled by some new competition.

Tuesday, May 26, 2009

Google Suggest and the right ad

In "Google Kills a Sacred Cow", Anders Bylund at the Motley Fool has an interesting take on Google's decision to show advertising in their search suggestion feature.

An excerpt:

You know how you get suggestions on ... the query you're typing into a Googlish search box? That suggestion list just became context-sensitive, more personalized, and more likely to send you to your destination without ever seeing a page of search results.

Oh, and you'll see advertising in that box, too. If you never see a results page, you might still send revenue Big G's way.

Yahoo recently promised to kill the typical "10 blue links" results in favor of more dynamic presentations. Google immediately took that idea one step further and did away with results altogether for many searches.

While Yahoo, Microsoft's MSN/Live/Kumo/whatever, Time Warner's AOL, and IAC/InterActiveCorp's Ask.com depend on exposing users to as many ads as possible by keeping them on their sites, Google is going the other way. All it takes is one ad -- the right ad in the right place -- and Google's cash flow is secure.

All it takes is one ad, the right ad in the right place.

Long ago, when I was at Amazon working on personalization, we used to joke that the ideal Amazon site would not show a search box, navigation links, or lists of things you could buy. Instead, it would just display a giant picture of one book, the next book you want to buy.

When you overwhelm people with choices, the important becomes lost in the mediocrity. Showing as many ads as possible just encourages ad fatigue. The focus of web advertising should be on getting the right ad in the right place. The focus should be on relevance.

Monday, May 18, 2009

The datacenter is the new mainframe

For a while, I have been planning on writing a post comparing large scale clusters with the mainframes of yore, a piece that would be full of colorful references to timesharing, scheduling and renting compute resources, and other tales that would date me as the fossil that I am.

Fortunately, Googlers Luiz Andre Barroso and Urs Holzle recently wrote a fantastic long paper, "The Datacenter as a Computer" (PDF), that not only spares me from this task, but covers it with much more data and insight than I ever could.

Some excerpts:

New large datacenters ... cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in [our] facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment.

In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC).

Much like an operating system layer is needed to manage resources and provide basic services in a single computer, a system composed of thousands of computers, networking, and storage also requires a layer of software that provides an analogous functionality at this larger scale.

[For example] resource management ... controls the mapping of user tasks to hardware resources, enforces priorities and quotas, and provides basic task management services. Nearly every large-scale distributed application needs ... reliable distributed storage, message passing, and cluster-level synchronization.

The paper goes on to describe the challenges of making an entire datacenter behave like it is one large compute resource to applications and programmers, including discussing the existing application frameworks and need for further tools.

Do not miss Figure 1.3 that shows latency, bandwidth, and capacity to resources in the data center. It includes an insightful look at latency to local and remote memory and the equivalent latencies but drastically different capacities of local and remote disk for programs running in the cluster. As the authors say, a key challenge is to expose these differences when they matter and hide them when they don't.

There are also thought-provoking tidbits on minimizing interference between jobs on the cluster and maximizing utilization, two goals that often are at odds with each other.

Much of the rest of the paper covers the cost and efficiency issues of data centers more generally, nicely and concisely summarizing much of the recent publicly known work on power, cooling, and infrastructure.

One small thing it does not mention is the ability to rent resources (e.g. EC2) on a WSC, much like buying time on the old mainframes, and the impact that could have on utilization, especially once we allow variable pricing and priorities.

[Google paper found via James Hamilton]

Update: A couple weeks later, security guru Bruce Schneier writes, "Cloud computing is nothing new. It's the modern version of the timesharing model from the 1960s."

Update: Seven months later, Amazon launches variable pricing for low priority jobs on EC2.

Monday, May 11, 2009

The potential of behavioral targeted advertising

An important WWW 2009 paper out of MSR Asia, "How much can Behavioral Targeting Help Online Advertising?" (PDF), looks at how much value we can get from targeting ads to past behavior. It is a must read for anyone interested in personalized advertising.

Some excerpts:

To our best knowledge, this work is the first systematic study for [behavioral targeting] (BT) on real world ad click-through [logs].

[We] empirically answer ... whether BT truly as the ability to help online advertising ... how much BT can help ... [and] which BT strategy can work better than others.

We observe that the users who clicked the same ad can be over 90 times more similar than the users who clicked different ads .... [which verifies] the basic assumption of BT.

We observe that .... the ads CTR can be improved as high as 670% by the simple user segmentation strategies used for behavioral targeted advertising .... [More] advanced user representation and user segmentation algorithms [yielded improvements of] more than 1,000%.

Through comparing different user representation strategies for BT, we draw the conclusion that user search behavior, i.e. user search queries, can perform several times better than user browsing behavior, i.e. user clicked pages. Moreover, only tracking the short term user behaviors are more effective than tracking the long term user behaviors.

One issue with the study is that it looks only at coarse-grained user segments, at most 160 segments, not fine-grained, one-to-one, personalized advertising. I would suspect there would be added benefit from fine-grained, personalized targeting to past behavior rather than clustering users into large groups. In fact, as Figure 5 in the paper shows, they do not appear to have hit the point of diminishing returns even on splitting into more segments.

Another issue is that they explicitly did not look at using demographic data or locality. It is possible that many of the BT user segments might be grouped roughly by locality, having derived it from search or browsing behavior. If that is true, then much of the gains they saw from BT could be reaped much more easily by targeting the ads to implicit local information. And, if we target to locality, then the additional gains we could expect from behavior targeting then might be much smaller.

But, limitations aside, it is a great paper. The authors clearly and cleanly state their key question -- what is the value of behavioral targeting for advertising? -- and then analyzes a massive historical log to convincingly derive the likely value. It also provides much guidance for those who might seek to build these or similar systems.

The paper has an interesting conclusion that recent search queries are the most useful indicators of people's interests when targeting ads. Yahoo's Andrei Broder said something similar recently when thinking about targeting advertising. It is also worth noting that others who were looking at the value of personalized search came to similar conclusions ([1] [2]).

For more on fine-grained personalized advertising, please see my earlier posts, "What to advertise when there is no commercial intent?" and "A brief history of Findory".

Wednesday, May 06, 2009

Exploiting spammers to make computers smarter

Googlers Rich Gossweiler, Maryam Kamvar, and Shumeet Baluja had a fun paper at WWW 2009, "What's Up CAPTCHA? A CAPTCHA Based On Image Orientation" (PDF), that asks people to rotate images correctly to prove they are human rather than the norm of deciphering distorted text.

Some brief excerpts from their paper:

We present a novel CAPTCHA which requires users to adjust randomly rotated images to their upright orientation ... Rotating images to their upright orientation is a difficult task for computers .... [Our] system ... results in a 84% human success rate and .009% bot success rate.

The main advantages of our CAPTCHA technique over traditional text recognition techniques are that it is language-independent, does not require text-entry (e.g. for mobile devices), and employs another domain for CAPTCHA generation beyond character obfuscation.

The paper goes on to say that "no algorithm has yet been developed to successfully rotate the set of images used in our CAPTCHA system." The key word there is "yet." As soon as there is a strong incentive for people to develop better algorithms for this problem, better algorithms will be developed.

But, as Luis von Ahn insightfully pointed out in a recent interview in New Scientist, it is a perfectly fine outcome if spammers find a way to break this new image-based CAPTCHA technique. By doing so, they are helping us make computers smarter.

From the New Scientist article:

"If [the spammers] are really able to write a programme to read distorted text, great – they have solved an AI problem," says von Ahn. The criminal underworld has created a kind of X prize for OCR.

Security groups ... [then] can ... switch for an alternative CAPTCHA system -- based on images, for example -- presenting the eager spamming community with a new AI problem to crack ... Image orientation is difficult for computers. But if [image-based] CAPTCHA becomes common, it won't be long before spammers turn their attention to cracking the problem, with potential fringe benefits to cameras and image editing software.

Speech recognition CAPTCHAs are already being used, and image labelling ones could follow, says von Ahn. AI researchers are already working in both these areas, but they could soon be joined by spammers also helping advance the technology.

Perhaps it is time to start designing CAPTCHAs in a different way -- pick problems that need solving and make them into targets to be solved by resourceful criminals.

Tuesday, May 05, 2009

More Microsoft layoffs

It is being widely reported ([1] [2] [3] [4] [5]) that Microsoft did another large round of layoffs today.

Microsoft never has done layoffs on this scale. It is a difficult thing to do well. For more on that, please see my earlier post, "Layoffs and tech layoffs".

Geeking with Greg