Googler Tushar Chandra recently gave a talk at the LADIS 2010 workshop on "Sibyl: A system for large scale machine learning" (slides available as PDF), which discusses a system that Google has built to do large scale machine learning on top of MapReduce and GFS.
This system probably is just one of many inside the secretive search giant, but we don't often get these kinds of peeks inside. What I found most interesting was the frank discussion of the problems the Google team encountered and how they overcame them.
In particular, the Googlers talk about how they building on top of a system intended for batch log processing caused some difficulties, which they overcame by using a lot of local memory and being careful about arranging data and moving data around. Even so, the last few slides mention how they kept causing localized network and GFS master brownouts, impacting other work on the cluster.
That last problem seems to have been an issue again and again in cloud computing systems. That pesky network is a scarce, shared resource, and it often takes a network brownout to remind us that virtual machines are not all it takes to get everyone playing nice.
On a related topic, please see my earlier post, "GFS and its evolution", and its discussion of the pain Google hit when trying to put other interactive workloads on top of GFS. And, if you're interested in Google's work here, you might also be interested in the open source Mahout, which is a suite of machine learning algorithms mostly intended for running on top of Hadoop clusters.
Tuesday, September 07, 2010
Friday, September 03, 2010
Insights into the performance of Microsoft's big clusters
A recent article from Microsoft in IEEE Computing, "Server Engineering Insights for Large-Scale Online Services" (PDF), has surprisingly detailed information about the systems running Hotmail, Cosmos (Microsoft's MapReduce/Hadoop), and Bing.
For example, the article describes the scale of Hotmail's data as being "several petabytes ... [in] tens of thousands of servers" and the typical Hotmail server as "dual CPU ... two attached disks and an additional storage enclosure containing up to 40 SATA drives". The typical Cosmos server apparently is "dual CPU ... 16 to 24 Gbytes of memory and up to four SATA disks". Bing uses "several tens of thousands of servers" and "the main memory of thousands of servers" where a typical server is "dual CPU ... 2 to 3 Gbytes per core ... and two to four SATA disks".
Aside from disclosing what appear to be some previously undisclosed details about Microsoft's cluster, the article could be interesting because of insights into the performance of these clusters on the Hotmail, Bing, and Cosmos workloads. Unfortunately, the article suffers from taking too much as a given, not exploring the complexity of interactions between CPU, memory, flash memory, and disk in these clusters on these workloads, and not attempting to explain the many oddities in the data.
Those oddities are fun to think about though. To take a few that caught my attention:
Update: An anonymous commenter points out that the Bing servers probably are two quad core CPUs -- eight cores total -- so, although there is only 2-3G per core, there likely is a total of 16-24G of RAM per box. That makes more sense and would make them similar to the Cosmos boxes.
Even with the larger amount of memory per Bing box, the questions about the machines still hold. Why are the Bing boxes CPU bound and should they be? Should Cosmos boxes, which are intended for sequential log processing, have the same memory as Bing boxes and be holding much of their data in memory? Why are Cosmos machines CPU bound rather than I/O bound and should they be?
Update: Interesting discussion going on in the comments to this post.
For example, the article describes the scale of Hotmail's data as being "several petabytes ... [in] tens of thousands of servers" and the typical Hotmail server as "dual CPU ... two attached disks and an additional storage enclosure containing up to 40 SATA drives". The typical Cosmos server apparently is "dual CPU ... 16 to 24 Gbytes of memory and up to four SATA disks". Bing uses "several tens of thousands of servers" and "the main memory of thousands of servers" where a typical server is "dual CPU ... 2 to 3 Gbytes per core ... and two to four SATA disks".
Aside from disclosing what appear to be some previously undisclosed details about Microsoft's cluster, the article could be interesting because of insights into the performance of these clusters on the Hotmail, Bing, and Cosmos workloads. Unfortunately, the article suffers from taking too much as a given, not exploring the complexity of interactions between CPU, memory, flash memory, and disk in these clusters on these workloads, and not attempting to explain the many oddities in the data.
Those oddities are fun to think about though. To take a few that caught my attention:
- Why are Bing servers CPU bound? Is it because, as the authors describe, Bing uses "data compression on memory and disk data ... causing extra processing"? Should Bing be doing so much data compression that it becomes CPU bound (when Google, by comparison, uses fast compression)? If something else is causing Bing servers to be CPU bound, what is it? In any case, does it make sense for the Bing "back-end tier servers used for index lookup" to be CPU bound?
- Why do Bing servers
have only 4-6G RAM eachnot have more memory when they mostly want to keep indexes in memory, appear to be hitting disk, and are "not bound by memory bandwidth"? Even if the boxes are CPU bound, even if it somehow makes sense for them to be CPU bound, would more memory across the cluster allow them to do things (like faster but weaker compression) that would relieve the pressure on the CPUs? - Why is Cosmos (the batch-based log processing system) CPU bound instead of I/O bound? Does that make sense?
- Why do Cosmos boxes have
morethe same memorythanas Bing boxes when Cosmos is designed for sequential data access? What is the reason that Cosmos "services maintain much of their data in [random access] memory" if they, like Hadoop and MapReduce, are intended for sequential log processing? - If Hotmail is mostly "random requests" with "insignificant" locality, why is it designed around sequential data access (many disks) rather than random access (DRAM + flash memory)? Perhaps the reason that Hotmail is "storage bound under peak loads" is that it uses sequential storage for its randomly accessed data?
Update: An anonymous commenter points out that the Bing servers probably are two quad core CPUs -- eight cores total -- so, although there is only 2-3G per core, there likely is a total of 16-24G of RAM per box. That makes more sense and would make them similar to the Cosmos boxes.
Even with the larger amount of memory per Bing box, the questions about the machines still hold. Why are the Bing boxes CPU bound and should they be? Should Cosmos boxes, which are intended for sequential log processing, have the same memory as Bing boxes and be holding much of their data in memory? Why are Cosmos machines CPU bound rather than I/O bound and should they be?
Update: Interesting discussion going on in the comments to this post.
Monday, August 30, 2010
What is the benefit of freaking customers out?
Miguel Helft and Tanzina Vega at the New York Times have a front page article today, "Retargeting Ads Follow Surfers to Other Sites", on a form of personalized web advertising now being called retargeting.
An excerpt:
If the goal of the advertising is to be useful and relevant, though, I think both of these are missing the mark. What you want to do is help people discover something they want to buy. Since the item they looked at before obviously wasn't quite right -- they didn't buy it after all -- showing that again doesn't help. Showing closely related alternatives, items that people might buy after rejecting the first item, could be quite useful though.
As marketing exec Alan Pearlstein says at the end of the NYT article, "What is the benefit of freaking customers out?" Remarketing freaks people out. If we are going to do personalized advertising, the goal should be to have the advertising be useful, either by sharing value with consumers using coupons as Pearlstein suggests, or by helping consumers find something interesting that they wouldn't have discovered on their own.
But, publishers should be careful when working with these new ad startups. A startup has a huge incentive to maximize short-term revenue and little incentive to maximize relevance. For the startup, as long as it brings in more immediate revenue, it is perfectly fine to show annoying ads that freak customers out and drive many away. Publishers need to force the focus to be on the value of the ads to the consumer so their customers are happy, satisfied, and keep coming back.
An excerpt:
People have grown accustomed to being tracked online and shown ads for categories of products they have shown interest in, be it tennis or bank loans.The article later goes on to contrast this technique of following you around with products you looked at before with behavioral targeting like Google is doing, which learns your broader category interests and shows ads from those categories.
Increasingly, however, the ads tailored to them are for specific products that they have perused online. While the technique, which the ad industry calls personalized retargeting or remarketing, is not new, it is becoming more pervasive as companies like Google and Microsoft have entered the field. And retargeting has reached a level of precision that is leaving consumers with the palpable feeling that they are being watched as they roam the virtual aisles of online stores.
In remarketing, when a person visits an e-commerce site and looks at say, an Etienne Aigner Athena satchel on eBags.com, a cookie is placed into that person’s browser, linking it with the handbag. When that person, or someone using the same computer, visits another site, the advertising system creates an ad for that very purse.
If the goal of the advertising is to be useful and relevant, though, I think both of these are missing the mark. What you want to do is help people discover something they want to buy. Since the item they looked at before obviously wasn't quite right -- they didn't buy it after all -- showing that again doesn't help. Showing closely related alternatives, items that people might buy after rejecting the first item, could be quite useful though.
As marketing exec Alan Pearlstein says at the end of the NYT article, "What is the benefit of freaking customers out?" Remarketing freaks people out. If we are going to do personalized advertising, the goal should be to have the advertising be useful, either by sharing value with consumers using coupons as Pearlstein suggests, or by helping consumers find something interesting that they wouldn't have discovered on their own.
But, publishers should be careful when working with these new ad startups. A startup has a huge incentive to maximize short-term revenue and little incentive to maximize relevance. For the startup, as long as it brings in more immediate revenue, it is perfectly fine to show annoying ads that freak customers out and drive many away. Publishers need to force the focus to be on the value of the ads to the consumer so their customers are happy, satisfied, and keep coming back.
Thursday, August 19, 2010
Measuring online brand advertising without experiments
A few Googlers recently published a paper with a terribly dull title, "Evaluating Online Ad Campaigns in a Pipeline: Causal Models at Scale" (abstract, PDF), at the KDD 2010 conference.
The paper turns out to be a quite interesting attempt to measure the impact of online display advertising -- a notoriously difficult problem -- by looking at how it changes people's searching and browsing online. That's hard enough, but these crazy Googlers also are trying to do this without using A/B testing. To do that last trick, they separate people into those the exposed who have seen the ad and the controls who have not seen the ad while carefully limiting the controls only to people who are similar to the exposed.
From the paper:
For more on related topics, please see also my November 2008 post, "Measuring offline ads by their online impact", and my July 2008 post, "Google Toolbar data and the actual surfer model".
The paper turns out to be a quite interesting attempt to measure the impact of online display advertising -- a notoriously difficult problem -- by looking at how it changes people's searching and browsing online. That's hard enough, but these crazy Googlers also are trying to do this without using A/B testing. To do that last trick, they separate people into those the exposed who have seen the ad and the controls who have not seen the ad while carefully limiting the controls only to people who are similar to the exposed.
From the paper:
Traditionally, online campaign effectiveness has been measured by "clicks" ... However, many display ads are not click-able ... and some campaigns hope to build longer-term interest in the brand rather than drive immediate response. Counting clicks alone then misses much of the value of the campaign.By the way, some have speculated in the past ([1] [2]) that Google toolbar data is being used for Google's advertising, but there was no public confirmation of that from Google. To my knowledge, this is the first public confirmation that data from Google's ubiquitous toolbar is being used by them in at least some way in their advertising.
Better measures of campaign effectiveness are based on the change in online brand awareness ... [due] to the display ad campaign alone. We ... [find] the change in probability that a user searches for brand terms or navigates to brand sites that can be attributed to an online ad campaign.
Randomized experiments ... are the gold standard for estimating treatment effects ... [but it] requires an advertiser to forego showing ads to some users ... [which] advertisers are not keen to [do] ... Estimation without randomization is more difficult but not always impossible .... Simply put, the controls [we pick] were eligible to be served campaign ads but were not.
Our estimates require summary (not personally identifiable) data on exposed and controls. The summary data are obtained from several sources, including the advertiser's own campaign information, ad serving logs, and sampled data from users who have installed Google toolbar and opted in to enhanced features.
For more on related topics, please see also my November 2008 post, "Measuring offline ads by their online impact", and my July 2008 post, "Google Toolbar data and the actual surfer model".
Monday, August 16, 2010
Human computation and lemons
NYU Professor Panos Ipeirotis has an insightful post, "Mechanical Turk, Low Wages, and the Market for Lemons", that looks at why wages are so low, usually well below minimum wage, on Amazon's MTurk.
His theory is that spammers and cheaters have turned MTurk into a market for lemons. The quality is now so bad that buyers demand a risk premium and require redundant work for quality checks, splitting what might be a risk-reduced fair wage three to five ways among the workers.
An excerpt from his post:
Please see also my 2005 post, "Amazon Mechanical Turk?", where I wrote, "If I scale up by doing cheaper answers, I won't be able to filter experts as carefully, and quality of the answers will be low. Many of the answers will be utter crap, just made up, quick bluffs in an attempt to earn money from little or no work. How will they deal with this?"
His theory is that spammers and cheaters have turned MTurk into a market for lemons. The quality is now so bad that buyers demand a risk premium and require redundant work for quality checks, splitting what might be a risk-reduced fair wage three to five ways among the workers.
An excerpt from his post:
A market for lemons is a market where the sellers cannot evaluate beforehand the quality of the goods that they are buying. So, if you have two types of products (say good workers and low quality workers) and cannot tell who is whom, the price that the buyer is willing to pay will be proportional to the average quality of the worker.If you like Panos' post, you might also be interested in GWAP guru and CMU Professor Luis von Ahn's recent post, "Work and the Internet", where Luis bemoans the low wages on MTurk and questions whether they amount to exploitation. Panos' post is a response to Luis'.
So the offered price will be between the price of a good worker and a low quality worker. What a good worker would do? Given that good workers will not get enough payment for their true quality, they leave the market. This leads the buyer to lower the price even more towards the price for low quality workers. At the end, we only have low quality workers in the market (or workers willing to work for similar wages) and the offered price reflects that.
This is exactly what is happening on Mechanical Turk today. Requesters pay everyone as if they are low quality workers, assuming that extra quality assurance techniques will be required on top of Mechanical Turk.
So, how can someone resolve such issues? The basic solution is the concept of signalling. Good workers need a method to signal to the buyer their higher quality. In this way, they can differentiate themselves from low quality workers.
Unfortunately, Amazon has not implemented a good reputation mechanism. The "number of HITs worked" and the "acceptance percentage" are simply not sufficient signalling mechanisms.
Please see also my 2005 post, "Amazon Mechanical Turk?", where I wrote, "If I scale up by doing cheaper answers, I won't be able to filter experts as carefully, and quality of the answers will be low. Many of the answers will be utter crap, just made up, quick bluffs in an attempt to earn money from little or no work. How will they deal with this?"
Friday, July 09, 2010
Big redesign at Google News
It has been widely reported that Google News has done a major redesign -- its first since 2002 apparently -- to more prominently feature personalization and customization.
Before I comment on it, in the interest of full disclosure, I should say that I am absurdly biased on this particular topic, having run Findory and talked at length over the years with Google, their partners, and their competitors about news personalization.
That said, I don't like what they've done. And I'm not the only one. Thomas Claburn at InformationWeek catalogs the complaints he is seeing at InformationWeek and elsewhere, summarizing it all by comparing it to the "New Coke" flop.
I think what the Google team has done is a lovely example of personalization done poorly, by people who really should know better. They change navigation links based on personalization even when confidence is low (one of my links in the left hand nav is for "Lindsay Lohan", which is hard to stomach). The article recommendations are often off, cannot be corrected, do not change in real time as you read articles, and there is no explanation of why something was recommended. There is no ability to see, edit, or rate your reading history. The ability to exclude or favor sources appears to be hacked on; the only way to do it is to manually type in the names of sources.
Under the surface, there appears still to be a lot of implicit personalization based on past behavior, but, from what someone using it sees, the focus is entirely on customization. I can "edit personalization" and "add sections" to put categories on my page. And that is about the limit of my control and the limit of the explanations of why articles are appearing. People like to be in control. They like to understand why something happens, especially if they don't agree with it. And Google News offers very little control or explanations.
Adding to the other problems, the design seems really busy and confused to me, like the Googlers can't decide what they are doing and -- in a fashion more typical of Microsoft -- just keep adding features. Hey, look, it's your fast flipping, clustered, personalized, customizable, widget-complete newspaper! Love it, it's Googly! C'mon, Google, what happened to keeping it clean, simple, and relevant?
Before I comment on it, in the interest of full disclosure, I should say that I am absurdly biased on this particular topic, having run Findory and talked at length over the years with Google, their partners, and their competitors about news personalization.
That said, I don't like what they've done. And I'm not the only one. Thomas Claburn at InformationWeek catalogs the complaints he is seeing at InformationWeek and elsewhere, summarizing it all by comparing it to the "New Coke" flop.
I think what the Google team has done is a lovely example of personalization done poorly, by people who really should know better. They change navigation links based on personalization even when confidence is low (one of my links in the left hand nav is for "Lindsay Lohan", which is hard to stomach). The article recommendations are often off, cannot be corrected, do not change in real time as you read articles, and there is no explanation of why something was recommended. There is no ability to see, edit, or rate your reading history. The ability to exclude or favor sources appears to be hacked on; the only way to do it is to manually type in the names of sources.
Under the surface, there appears still to be a lot of implicit personalization based on past behavior, but, from what someone using it sees, the focus is entirely on customization. I can "edit personalization" and "add sections" to put categories on my page. And that is about the limit of my control and the limit of the explanations of why articles are appearing. People like to be in control. They like to understand why something happens, especially if they don't agree with it. And Google News offers very little control or explanations.
Adding to the other problems, the design seems really busy and confused to me, like the Googlers can't decide what they are doing and -- in a fashion more typical of Microsoft -- just keep adding features. Hey, look, it's your fast flipping, clustered, personalized, customizable, widget-complete newspaper! Love it, it's Googly! C'mon, Google, what happened to keeping it clean, simple, and relevant?
Tuesday, June 22, 2010
Google to personalize metashopping
In an interview with CNet, Googler Sameer Samat talks about Google's future plans for shopping search, including personalization and recommendations.
An excerpt:
An excerpt:
One thing Google doesn't do very well is provide the shopping-as-adventure experience ... You might go to the mall with a specific product in mind, but a well-designed mall ... forces you to discover -- and hopefully purchase -- other products that you might not have even known you wanted: the marketing types like to call this "serendipity." Google wants to be known as a destination for that kind of experience, said Sameer Samat, director of product management.On a related note, Google is pushing aggressively to get retailers to use Google's commerce search engine to run their search experience. Each deal Google signs gives them more detailed information about another retailing vertical. It's all about the data.
After years of trying and failing to reach that goal, Google plans to give it another go over the coming months. Don't expect Google to turn into a full-blown online retailer among the likes of Amazon.com or Buy.com just yet. But the combination of personalized features for product search pages and what Samat thinks is "the largest database of products that has been created" could entice people to actually shop on Google.
Google's current approach works best for those who are on a mission when they shop, shoppers who already know what they want and are just looking for additional information before sealing the deal ... [But] there are millions of other people who treat shopping as leisure, rather than a simple transaction. These are people who ... prefer browsing to targeted shopping, knowing that every now and then they'll discover something totally unique or completely unexpected.
Google wants to serve more of those people ... [by making] recommendations based on that list of products and lists submitted by others to help you discover new products: sort of like Amazon's recommendations page meets Pandora's radio stations meets Google.
"Shopping is not just about search, it's not just about intent, it's about discovery," Samat said. "If we can do it, and do it well, we will have built something that's really amazing; it should be the most comprehensive experience for shopping you could ever find."
Monday, June 14, 2010
Google on presentation bias in search
A few Googlers at WWW 2010 had a paper, "Beyond Position Bias: Examining Result Attractiveness as a Source of Presentation Bias in Clickthrough Data" (PDF), that explores how much people tend to click on eye-catching search results rather than seeking the most relevant search results.
The work itself was pretty simple -- just looking at how bolding title and abstract terms changes clickthough rates in A/B tests -- but I think the paper is worth a peek for two reasons. First, it is a decent survey of some of the current work on position and presentation bias. Second, it exposes some of Google's struggles with the difficulty of deriving searcher satisfaction from the noisy proxies that we have available like click data.
By the way, I love the fact, noted in the paper, that people tend to click on the last result much more than you would expect. The reason is that people don't linearly scan down a page, but often jump to the bottom and focus attention there. A decade ago at Amazon, the personalization team exploited this effect and seized the space at the bottom of most pages on the site for our features. You see, when we saw no one had built tools to track click and conversion data, we built them, and then we used them. No one else realized the value of the space at the bottom of the page, but we did.
For more on the struggle to evaluate search results from noisy click data, please see some of my older posts, "Modeling how searchers look at search results ", "Finding task boundaries in search logs", and "Testing rankers by interleaving search results".
The work itself was pretty simple -- just looking at how bolding title and abstract terms changes clickthough rates in A/B tests -- but I think the paper is worth a peek for two reasons. First, it is a decent survey of some of the current work on position and presentation bias. Second, it exposes some of Google's struggles with the difficulty of deriving searcher satisfaction from the noisy proxies that we have available like click data.
By the way, I love the fact, noted in the paper, that people tend to click on the last result much more than you would expect. The reason is that people don't linearly scan down a page, but often jump to the bottom and focus attention there. A decade ago at Amazon, the personalization team exploited this effect and seized the space at the bottom of most pages on the site for our features. You see, when we saw no one had built tools to track click and conversion data, we built them, and then we used them. No one else realized the value of the space at the bottom of the page, but we did.
For more on the struggle to evaluate search results from noisy click data, please see some of my older posts, "Modeling how searchers look at search results ", "Finding task boundaries in search logs", and "Testing rankers by interleaving search results".
Tuesday, June 08, 2010
Travel itineraries from Flickr photo trails
Every once in a while, you hit a paper that seems like a startup waiting to happen. A paper that will be presented next week at HT 2010 by Yahoo Research is one of these.
The paper, "Automatic Construction of Travel Itineraries using Social Breadcrumbs" (PDF), cleverly uses the data often embedded in Flickr photos (e.g. timestamp, tags, sometimes GPS) to produce trails of where people have been in their travels. Then, they combine all those past trails to generate high quality itineraries for future tourists that tell them what to see, where to go, how long to expect to spend at each sight, and how long to allow for travel times between the sights.
Some excerpts from the paper:
Fun idea, nicely implemented, and very convincing results. Definitely worth a read. Don't miss the thoughts at the end on expansions to the idea, such as changing how the trails are filtered and aggregated based on individual preferences to generate personalized itineraries.
The paper, "Automatic Construction of Travel Itineraries using Social Breadcrumbs" (PDF), cleverly uses the data often embedded in Flickr photos (e.g. timestamp, tags, sometimes GPS) to produce trails of where people have been in their travels. Then, they combine all those past trails to generate high quality itineraries for future tourists that tell them what to see, where to go, how long to expect to spend at each sight, and how long to allow for travel times between the sights.
Some excerpts from the paper:
Shared photos can be seen as billions of geo-temporal breadcrumbs that can promisingly serve as a latent source reflecting the trips of millions of users ... [We] automatically construct travel itineraries at large scale from those breadcrumbs.This reminds me quite a bit of the work on using GPS trails from mobile devices like phones (e.g. [1] or [2]) or search histories on maps (e.g. [3]). But, the use of Flickr photos as the data source is clever, especially for this application where the photos are also useful in the final output and the gaps in the data stream are not important.
By analyzing these breadcrumbs associated with a person's photo stream, one can deduce the cities visited by a person, which Points of Interest (POI) that the person took photos at, how long that person spent at each POI, and what the transit time was between POIs visited in succession.
By aggregating such timed paths of many users, one can construct itineraries that reflect the "wisdom" of touring crowds. Each such itinerary is composed of a sequence of POIs, with recommended visit times and approximate transit times between them.
[In surveys] users perceive our automatically generated itineraries to be as good as (or even slightly better than) itineraries provided by professional tour companies.
Fun idea, nicely implemented, and very convincing results. Definitely worth a read. Don't miss the thoughts at the end on expansions to the idea, such as changing how the trails are filtered and aggregated based on individual preferences to generate personalized itineraries.
Monday, June 07, 2010
A Findory buyout offer from Yahoo?
I just received this check from Yahoo payable to Findory.com:

I've never seen a check for $0.00 before. Is that a buyout bid, Yahoo? Your offer for Findory?
In all seriousness, I'm sure this is just an mistake over in Yahoo-land, perhaps leftovers from the experiments Findory did with layering its personalized advertising engine over ads pulled from the Yahoo Publisher Network.
Still, years after Findory ended, years after Findory talked to Yahoo, and seeing how Yahoo now struggles to personalize content, I can't help but laugh at this check for $0.00 from Yahoo to Findory.

I've never seen a check for $0.00 before. Is that a buyout bid, Yahoo? Your offer for Findory?
In all seriousness, I'm sure this is just an mistake over in Yahoo-land, perhaps leftovers from the experiments Findory did with layering its personalized advertising engine over ads pulled from the Yahoo Publisher Network.
Still, years after Findory ended, years after Findory talked to Yahoo, and seeing how Yahoo now struggles to personalize content, I can't help but laugh at this check for $0.00 from Yahoo to Findory.
Tuesday, June 01, 2010
How Bing predicts the CTR of ads
An upcoming ICML 2010 paper, "Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine", describes the algorithm actually used in the Bing search engine to predict the clickthrough rates of ads.
From the paper:
Don't miss the tidbit at the end where they say that they are "investigating the use of more powerful models, such as the feature-based collaborative filtering method Matchbox (Stern, Herbrich, & Graepel, 2009) for latent feature discovery and personalisation."
From the paper:
Recognising the importance of CTR estimation for online advertising ... Bing/adCenter decided to run a competition to entice people across the company to develop the most accurate and scalable CTR predictor.The paper goes on to describe why the problem is important, the algorithm used, and some of the nastiness of getting something that works in the lab to run on the live site.
The algorithm described in this publication tied for first place in the first competition and won the subsequent competition based on prediction accuracy. As a consequence, it was chosen to replace Bing's previous CTR prediction algorithm, a transition that was completed in the summer of 2009.
Don't miss the tidbit at the end where they say that they are "investigating the use of more powerful models, such as the feature-based collaborative filtering method Matchbox (Stern, Herbrich, & Graepel, 2009) for latent feature discovery and personalisation."
Thursday, May 20, 2010
Geeking with Greg administrivia
This blog passed some fun milestones recently. It just had its six year anniversary, logging over 1500 posts in its lifetime.
When I started this blog in 2004, I wanted to bring personalization to information. Just as personalization and recommendations help people discover what they want in a massive product catalog, I thought personalization and recommendations could help tame the information streams that flood over us. Over the next five years, I wrote on this blog as I worked on personalized news, personalized search, and personalized advertising.
Regular readers may have noticed that posts on this blog have slowed down a fair bit over the last year. In large part, this is because I am no longer working on personalized search or personalized advertising, nor does my work still benefit from tracking what is going on in the information retrieval research community.
I try hard to keep this blog on topic. My plan is to have posts here continue to focus on personalized information, perhaps a bit on research papers, but mostly tracking the increasingly aggressive moves of the internet giants toward personalized search and advertising. But, that means posts will continue to be fairly infrequent.
If for some reason you can't get enough of geeking with me, if you really must have more, you might consider tracking what I am reading more broadly by following me in Google Reader. I use the shared items there as a link blog and comment broadly there on many topics.
Also, on this sixth anniversary, I welcome feedback on what you might enjoy seeing more of on this blog, especially from long-time readers. Do you mostly like the reviews of research papers (which, many have told me, are a great time-saver)? Comments on events in the industry (which might be snark, but perhaps useful snark)? Do you find the posts to be too long or too infrequent? Do you care if this blog stays on the topic of personalized information? If there is anything you want more of, please let me know in the comments!
When I started this blog in 2004, I wanted to bring personalization to information. Just as personalization and recommendations help people discover what they want in a massive product catalog, I thought personalization and recommendations could help tame the information streams that flood over us. Over the next five years, I wrote on this blog as I worked on personalized news, personalized search, and personalized advertising.
Regular readers may have noticed that posts on this blog have slowed down a fair bit over the last year. In large part, this is because I am no longer working on personalized search or personalized advertising, nor does my work still benefit from tracking what is going on in the information retrieval research community.
I try hard to keep this blog on topic. My plan is to have posts here continue to focus on personalized information, perhaps a bit on research papers, but mostly tracking the increasingly aggressive moves of the internet giants toward personalized search and advertising. But, that means posts will continue to be fairly infrequent.
If for some reason you can't get enough of geeking with me, if you really must have more, you might consider tracking what I am reading more broadly by following me in Google Reader. I use the shared items there as a link blog and comment broadly there on many topics.
Also, on this sixth anniversary, I welcome feedback on what you might enjoy seeing more of on this blog, especially from long-time readers. Do you mostly like the reviews of research papers (which, many have told me, are a great time-saver)? Comments on events in the industry (which might be snark, but perhaps useful snark)? Do you find the posts to be too long or too infrequent? Do you care if this blog stays on the topic of personalized information? If there is anything you want more of, please let me know in the comments!
Friday, May 14, 2010
Yahoo as an internet information filter
Yahoo CEO Carol Bartz is talking up personalization again, this time in an amusingly titled Esquire article:
[Esquire article found Nick Carr]
When you talk about the Internet growing to 225 million sites, you've got to ask: Who's parsing all that? How do you make sense of all that stuff?Please see also my June 2009 post, "Yahoo CEO Carol Bartz on personalization", for more on Yahoo's vision of recommending information.
I mean, who has time to wander all over the Internet?
Tomorrow's Yahoo! is going to be really tailored. I'm not talking about organization — organizing means that you already know what you want and somebody's just putting it in shape for you. I'm talking about both smart science and people culling through masses of information on the fly and figuring out what people want to know.
We will be delivering your interests to you. For instance, if you're a sports fan but have no interest in tennis, we won't show you tennis. We would know that you do things in a certain sequence, so we'd say, "Here's your portfolio. Here's some news you might like. Oh, you went to this movie last week, here's some other movies you might want to check out."
I call it the Internet of One. I want it to be mine, and I don't want to work too hard to get what I need. In a way, I want it to be HAL. I want it to learn about me, to be me, and cull through the massive amount of information that's out there to find exactly what I want.
[Esquire article found Nick Carr]
Tuesday, May 11, 2010
Google tries to save the news
James Fallows at The Atlantic has a new article out, "How to Save the News", on what Googlers think about the future of the news industry.
Some key excerpts:
Please see also my Oct 2009 post, "Google CEO on personalized news".
Some key excerpts:
"If you were starting from scratch, you could never possibly justify [the current] business model," [Google Chief Economist] Hal Varian said ... "Grow trees -- then grind them up, and truck big rolls of paper down from Canada? Then run them through enormously expensive machinery, hand-deliver them overnight to thousands of doorsteps, and leave more on newsstands, where the surplus is out of date immediately and must be thrown away? Who would say that made sense?" The old-tech wastefulness of the process is obvious, but Varian added a less familiar point. Burdened as they are with these "legacy" print costs, newspapers typically spend about 15 percent of their revenue on what, to the Internet world, are their only valuable assets: the people who report, analyze, and edit the news.The entire article is well worth reading. It gives a great feel for how Googlers are thinking about the future of news (and is mostly in line with my own thoughts).
"Nothing that I see suggests the 'death of newspapers,'" [Google CEO] Eric Schmidt told me. The problem was the high cost and plummeting popularity of their print versions. "Today you have a subscription to a print newspaper," he said. "In the future model, you'll have subscriptions to information sources that will have advertisements embedded in them, like a newspaper. You'll just leave out the print part. I am quite sure that this will happen ... As print circulation falls, the growth of the online audience is dramatic ... Newspapers don't have a demand problem; they have a business-model problem." Many of his company’s efforts are attempts to solve this, so that newspaper companies can survive, as printed circulation withers away.
The three pillars of the new online business model, as I heard them invariably described, are distribution, engagement, and monetization. That is: getting news to more people, and more people to news-oriented sites; making the presentation of news more interesting, varied, and involving; and converting these larger and more strongly committed audiences into revenue, through both subscription fees and ads.
The best monetizing schemes are of course ones that people like -- ads they enjoy seeing, products for which they willingly pay. Online display ads should be better on these counts too, [Google VP Neal] Mohan said. "here are things we can do online that we simply can't do in print," he said. An ad is "intrusive" mainly if it is not related to what you care about at that time ... "The online world will be a lot more attuned to who you are and what you care about" ... Advertising has been around forever, Mohan said, "but until now it has always been a one-way conversation."
Please see also my Oct 2009 post, "Google CEO on personalized news".
Friday, April 30, 2010
Facebook's moves and personalized advertising
It has been widely reported that Facebook has launched Open Graph and Implicit Personalization, which, among other things, give Facebook information about people's movements and what they like on the web. The service was launched opt-out and, even if you do want to opt-out, requires diving into confusing privacy settings to opt-out.
The prolific discussion of this elsewhere has thoroughly exhausted most of what there is to say, but I wanted to emphasize two things about this launch.
First, the fact that Facebook is so aggressively seeking this "treasure trove" of browsing behavior data may signal a major shift in its revenue model. Prior claims aside, the company now may be realizing that it is hard to target advertising to profile information and status updates because there is no commercial intent. This new source of data -- the websites people are visiting and what they like -- contains the purchase intent that Facebook so desperately needs.
Second, as Steve Lohr at the NYT reported today, other companies considering heavy use of personalized advertising have been waiting for someone else to take the first step and bear the brunt of any privacy-related backlash. It will be interesting to see if Facebook's latest move -- which probably is aggressive enough to count as the first step everyone was waiting for -- will result in a backlash or will open the floodgates.
The prolific discussion of this elsewhere has thoroughly exhausted most of what there is to say, but I wanted to emphasize two things about this launch.
First, the fact that Facebook is so aggressively seeking this "treasure trove" of browsing behavior data may signal a major shift in its revenue model. Prior claims aside, the company now may be realizing that it is hard to target advertising to profile information and status updates because there is no commercial intent. This new source of data -- the websites people are visiting and what they like -- contains the purchase intent that Facebook so desperately needs.
Second, as Steve Lohr at the NYT reported today, other companies considering heavy use of personalized advertising have been waiting for someone else to take the first step and bear the brunt of any privacy-related backlash. It will be interesting to see if Facebook's latest move -- which probably is aggressive enough to count as the first step everyone was waiting for -- will result in a backlash or will open the floodgates.
Wednesday, April 28, 2010
Google launches web search similarities
In another aggressive move toward personalized search, Google has added similarities to some web search results to the bottom of the page, as Mike Melanson at ReadWriteWeb reports.
For example, if I search for [engadget], at the bottom of the page, I see:
This is a fairly big deal. Similarities based aggregate behavior and targeted to the immediate context is a big step toward personalization. The next step is to tie the data to individual history and target similarities and recommendations both to the context and each person's past history.
Google has done a version of personalized search for some time, but the technique used mostly was based on biasing search results toward people's long-term interests. More recently, they also started boosting previously clicked search results to help support re-finding.
Google's latest move should let Google add fine-grained personalization based on current missions and short-term trends, which, in combination with their current search personalization, is likely to improve Google's ability to help people find what they need.
Update: Here is the announcement of the new feature on the Official Google Blog.
For example, if I search for [engadget], at the bottom of the page, I see:
Pages similar to: www.engadget.comThe similarity algorithm also appears to have changed, with noticeably better quality in the spot checks I did.
Gizmodo ... gizmodo.com
Ubergizmo ... ubergizmo.com
Wired ... wired.com
Lifehacker ... lifehacker.com
This is a fairly big deal. Similarities based aggregate behavior and targeted to the immediate context is a big step toward personalization. The next step is to tie the data to individual history and target similarities and recommendations both to the context and each person's past history.
Google has done a version of personalized search for some time, but the technique used mostly was based on biasing search results toward people's long-term interests. More recently, they also started boosting previously clicked search results to help support re-finding.
Google's latest move should let Google add fine-grained personalization based on current missions and short-term trends, which, in combination with their current search personalization, is likely to improve Google's ability to help people find what they need.
Update: Here is the announcement of the new feature on the Official Google Blog.
Sunday, April 25, 2010
Google News hybrid recommendations
Three Googlers published a paper, "Personalized News Recommendation Based on Click Behavior" (ACM), at the recent IUI 2010 conference that describes a hybrid recommender system combining user-based and content-based recommendations. This new hybrid recommender now appears to be deployed on Google News.
Some excerpts from the paper:
One thing that is surprising in this paper is the use of high-level topics rather than fine-grained topics. I would think that you would be better off getting as specific as possible on the profile, then branching out to related topics. The paper briefly addresses this, arguing that "specializing the user profile may limit the recommendations to news that the user already knew", but that seems like it would only happen if you rather foolishly only used read topics rather than including topics that appear to be related to read topics.
By the way, when you have as much data as Google should have, it is not at all clear you want to fall back on a content approach like they did in this paper here. Yehuda Koren, for example, has convincingly argued that, when you have big data, latent factor models extract these content-based relationships automatically in much more detail and much more accurately than you could hope to do with a manually constructed model.
Finally, I cannot quite let this one go by without mentioning that Findory was a hybrid news recommender, launched in January 2004, that dealt with the cold start and sparsity problems of a collaborative recommender, the same problems the Google News team apparently is still struggling with six years later. Findory is not mentioned in this paper in the related work, but I know the Google team is quite aware of Findory.
Some excerpts from the paper:
[The] previous Google News recommendation system was developed using a collaborative filtering method. It recommends news stories that were read by users with similar click history. This method has two major drawbacks ... First, the system cannot recommend stories that have not yet been read by other users ... Second ... news stories [that] are generally very popular ... are constantly recommended to most of the users, even for those users who never [are interested because] ... there are always enough clicks ... to make the recommendation.Hybrid recommenders are not that new. In the past, as in this paper, they usually were motivated by trying to deal with the sparsity and cold start problems that challenge collaborative filtering recommenders. Hybrid systems also have been used to deal with the so-called Harry Potter problem -- recommendations that focus too much on popular items -- by constraining the collaborative recommendations to the interests expressed in the profile, though that often can be better dealt with by tuning a collaborative recommender to discourage correlations between unpopular and popular items.
A solution to these two problems would be to build profiles of user's genuine interests and use them to make news recommendations. The profiles ... filter out the stories that are not of interest ... [and recommend stories] even if [they have] not been clicked on by other users ... Based on a user's news reading history, the recommender predicts the topic categories of interest ... News articles in those categories are ranked higher in the candidate list.
On average, the hybrid method ... improves the CTR [of] the existing collaborative method by 30.9% ... [and increased] the frequency of website visits in the test group [by] 14.1%.
One thing that is surprising in this paper is the use of high-level topics rather than fine-grained topics. I would think that you would be better off getting as specific as possible on the profile, then branching out to related topics. The paper briefly addresses this, arguing that "specializing the user profile may limit the recommendations to news that the user already knew", but that seems like it would only happen if you rather foolishly only used read topics rather than including topics that appear to be related to read topics.
By the way, when you have as much data as Google should have, it is not at all clear you want to fall back on a content approach like they did in this paper here. Yehuda Koren, for example, has convincingly argued that, when you have big data, latent factor models extract these content-based relationships automatically in much more detail and much more accurately than you could hope to do with a manually constructed model.
Finally, I cannot quite let this one go by without mentioning that Findory was a hybrid news recommender, launched in January 2004, that dealt with the cold start and sparsity problems of a collaborative recommender, the same problems the Google News team apparently is still struggling with six years later. Findory is not mentioned in this paper in the related work, but I know the Google team is quite aware of Findory.
Wednesday, March 24, 2010
Asking questions of your social network
Google recently acquired the startup Aardvark for $50 million. The idea behind Aardvark is to provide a way to ask complicated or subjective questions of your friends and colleagues.
There are two papers that will be published at upcoming conferences that provide useful details on this idea. The first is actually by two members of the Aardvark team -- co-founder and Aardvark CTO Damon Horowitz and ex-Googler and Aardvark advisor Sep Kamvar -- and will be published at the upcoming WWW 2010 conference. The paper is called, "The Anatomy of a Large-Scale Social Search Engine" (PDF). An excerpt:
To get a bit more illumination on that question, turn to another upcoming paper, this one out of Microsoft Research and to be published at CHI 2010. The paper is "What Do People Ask Their Social Networks, and Why? A Survey Study of Status Message Q&A Behavior" (PDF). Some excerpts:
There are two papers that will be published at upcoming conferences that provide useful details on this idea. The first is actually by two members of the Aardvark team -- co-founder and Aardvark CTO Damon Horowitz and ex-Googler and Aardvark advisor Sep Kamvar -- and will be published at the upcoming WWW 2010 conference. The paper is called, "The Anatomy of a Large-Scale Social Search Engine" (PDF). An excerpt:
With Aardvark, users ask a question, either by instant message, email, web input, text message, or voice. Aardvark then routes the question to the person in the user's extended social network most likely to be able to answer that question.The paper is well written and convincing, establishing that this idea works reasonably well for a small (50k) group of enthusiastic early adopters. The paper does not answer whether this will work at large scale on a less motivated, lower quality mainstream audience. It also does not provide data to be able to evaluate the common criticism of asking questions of social networks, which is that, at large scale, the burden from a flood of often irrelevant incoming questions creates too much pain for too little benefit.
Aardvark queries tend to be long, highly contextualized, and subjective -- in short, they tend to be the types of queries that are not well-serviced by traditional search engines. We also find that the vast majority of questions get answered promptly and satisfactorily, and that users are surprisingly active, both in asking and answering.
To get a bit more illumination on that question, turn to another upcoming paper, this one out of Microsoft Research and to be published at CHI 2010. The paper is "What Do People Ask Their Social Networks, and Why? A Survey Study of Status Message Q&A Behavior" (PDF). Some excerpts:
50.6% ... used their status messages to ask a question .... [on] sites like Facebook and Twitter .... 24.3% received a response in 30 minutes or less, 42.8% in one hour or less, and 90.1% within one day .... 69.3% ... who received responses reported they found the responses helpful.The key insight in the CHI paper is that people view asking questions of their social network as fun, entertaining, part of building relationships, and as a form of a gift exchange. The Aardvark paper focuses on a topical relevance rank of your social network, but maintaining relevance is going to be difficult at large scale when you have an unmotivated, lower quality, mainstream audience. The CHI paper might offer a path forward, suggesting we instead focus on game playing, entertainment, and the social rewards people enjoy when answering questions from their network.
The most common reason to search socially, rather than through a search engine, was that participants had more trust in the responses provided by their friends [24.8%]. A belief that social networks were better than search engines for subjective questions, such as seeking opinions or recommendations, was also a common explanation [21.5%].
The most common motivation given for responding to a question was altruism [37.0%]. Expertise was the next biggest factor [31.9%], with respondents being motivated because they felt they had special knowledge of the topic ... Nature of the relationship with the asker was an important motivator [13.7%], with closer friends more likely to get answers ... [as well as] the desire to connect socially [13.5%] ... free time [12.3%] ... [and] earning social capital [10.5%].
Many indicated they would prefer a face-to-face or personal request, and ignored questions directed broadly to the network-at-large .... [But] participants enjoyed the fun and social aspects of posing questions to their networks.
Tuesday, March 23, 2010
Security advice is wrong
An insightful, funny, and thought-provoking paper by Cormac Herley at Microsoft Research, "So Long, And No Thanks for the Externalities: The Rational Rejection of Security Advice by Users" (PDF), looks at why people ignore security advice.
The surprising conclusion is that some security advice we give to people -- such as inspect URLs carefully, pay attention to https certificate warnings, and use complicated passwords that change frequently -- does more harm than good. It actually costs someone far more to follow the advice than the benefit that person should expect to get.
Extended excerpts from the paper:
[Paper found via Bruce Schneier]
The surprising conclusion is that some security advice we give to people -- such as inspect URLs carefully, pay attention to https certificate warnings, and use complicated passwords that change frequently -- does more harm than good. It actually costs someone far more to follow the advice than the benefit that person should expect to get.
Extended excerpts from the paper:
It is often suggested that users are hopelessly lazy and unmotivated on security ... [This] is entirely rational from an economic perspective ... Most security advice simply offers a poor cost-benefit tradeoff to users and is rejected.The paper also has a great discussion of password policies and how they appear to be counter-productive. When system administrators require passwords with weird special characters than need to be changed regularly, they make passwords difficult to remember and impose a substantial burden on users, but the benefit from this policy appears to be minimal.
[Security] advice offers to shield [people] from the direct costs of attacks, but burdens them with increased indirect costs ... Since victimization is rare, and imposes a one-time cost, while security advice applies to everyone and is an ongoing cost, the burden ends up being larger than that caused by the ill it addresses.
To make this concrete, consider an exploit that affects 1% of users annually, and they waste 10 hours clearing up when they become victims. Any security advice should place a daily burden of no more than 10/(365 * 100) hours or 0.98 seconds per user in order to reduce rather than increase the amount of user time consumed. This generated the profound irony that much security advice ... does more harm than good.
We estimate US annual phishing losses at $60 million ... Even for minimum wage any advice that consumes more than ... 2.6 minutes per year to follow is unprofitable [for users] from a cost benefit point of view ... Banks [also] have more to fear from ... indirect losses such as support costs ... than direct losses. For example ... an agent-assisted password reset by 10% of their users would cost $48 million, easily dwarfing Wells Fargo's share of the overall $60 million in phishing losses.
Users are effectively trained to ignore certificate warnings by seeing them repeatedly when there is no real security threat .... As far as we can determine, there is no evidence of a single user being saved from harm by a certificate error, anywhere, ever ... The idea that certificate errors are a useful tool in protecting [users] from harm ... is entirely abstract and not evidence-based. The effort we ask of [users] is real, while the harm we warn them of is theoretical.
Advice almost always ignores the cost of user effort ... The main reason security advice is ignored is that it makes an enormous miscalculation: it treats as free a resource that is actually worth $2.6 billion an hour ... Advice-givers and policy-mandaters demand far more effort than any user can give .... User education is a cost borne by the whole population, while offering benefit only to the fraction that fall victim ... The cost of any security advice should be in proportion to the victimization rate .... [lest] in trying to defend against everything we end up defending nothing.
It is not users who need to be better educated on the risks of various attacks, but the security community. Security advice simply offers a bad cost-benefit tradeoff to users .... We must respect users' time and effort. Viewing the user's time as worth $2.6 billion an hour is a better starting point than valuing it at zero ... When we exaggerate all dangers we simply train users to ignore us.
[Paper found via Bruce Schneier]
Sunday, March 21, 2010
World of Goo on Google
This is too funny not to post. The clever and tremendously fun puzzle game World of Goo, in one of its last levels, has a hilarious jab at the search engines, personalized search, and personalized advertising.
An excerpt from the dialogue with "MOM ... [the] Automated Search Engine Companion":
An excerpt from the dialogue with "MOM ... [the] Automated Search Engine Companion":
What would you like to ask MOM today?Video is available as well as a full transcript (search for [Conversation with MOM transcript]).
Do I smell cookies? ... Yes! I am baking your personal information into each one.
What personal information is in your cookies? ... Cookies may contain:your locationThat's MOM's special recipe for offering free, convenient, relevant information to valued users like you.
complete search history
whipped soy product
online purchases
medical records
telephone logs
web logs
email logs
streaming video from your current location
and more!
Is my personal information safe with you? ... Your personal information is stored in a secure database and will never be shared with anyone*
* Unless they ask.
** Or if someone says they are you and takes your cookies
*** Or if the venture firm finds out it's profitable.
**** Or if unhappy employees release copies of your cookies to other online databases.
***** Or if outsourced data centers sell illegal copies of your cookies to other sites.
****** Or if my parent corporation is acquired and my data including your cookies becomes part of a larger aggregate system without your knowledge or consent.
Delete my cookies ... Are you sure? ... Yes ... Your cookies have been deleted*.
* Cookies may not actually be deleted.
** Cookies may be stored indefinitely for evaluation and training purposes to better serve you.
*** MOM knows best.
Everyone loves receiving special offers from MOM and the MOM's affiliate network of adver-bots.
Subscribe to:
Posts (Atom)
