Googler Jake Brutlag recently published a short study, "Speed Matters for Google Web Search" (PDF), which looked at how important it is to deliver and render search result pages quickly.
Specifically, Jake added very small delays (100-400ms) to the time to serve and render Google search results. He observed that even these tiny delays, which are low enough to be difficult for users to perceive, resulted in measurable drops in searches per user (declines of -0.2% to -0.6%).
Please see also my Nov 2006 post, "Marissa Mayer at Web 2.0", which summarizes a claim by Googler Marissa Mayer that Google saw a 20% drop in revenue from an accidentally introduced 500ms delay.
Update: To add to the Marissa Mayer report above, Drupal's Dries Buytaert summarized the results of a few A/B tests at Amazon, Google, and Yahoo on the impact of speed on user satisfaction. As Dries says, "Long story short: even the smallest delay kills user satisfaction."
Update: In the comments, people are asking why the effect in this study oddly appears to be an order of magnitude lower than the effects seen in previous tests. Good question there.
Update: By the way, this study is part of a broader suite of tools and tutorials Google has gathered as part of an effort to "make the web faster".
Monday, June 29, 2009
Friday, June 26, 2009
The $1M Netflix Prize has been won
An ensemble of methods from four teams has passed the criteria to win the Netflix Prize.
Other teams have 30 days to beat it, but, no matter what happens, the $1M prize will be claimed in the next month.
Congratulations to the winning team and all the competitors. It was a goal that some thought impossible without additional data, but remarkable persistence has proven the impossible possible.
Please see also my earlier post, "On the front lines of the Netflix Prize", which summarizes an article that describes some of the algorithms that brought the winning team to where it is now.
Other teams have 30 days to beat it, but, no matter what happens, the $1M prize will be claimed in the next month.
Congratulations to the winning team and all the competitors. It was a goal that some thought impossible without additional data, but remarkable persistence has proven the impossible possible.
Please see also my earlier post, "On the front lines of the Netflix Prize", which summarizes an article that describes some of the algorithms that brought the winning team to where it is now.
Tuesday, June 23, 2009
Is mobile search going to be different?
In an amusingly titled WWW 2009 paper, "Computers and iPhones and Mobile Phones, oh my!" (PDF), a quartet of Googlers offer some thoughts on where mobile search may be going.
In particular, based on log analysis of iPhone searches, they claim search on mobile devices is not likely to differ from normal web search once people upgrade to the latest phones. They go on to predict that an important future feature for mobile search will be providing history and personalization synchronized across all of a person's computers and mobile devices.
Some excerpts:
In particular, based on log analysis of iPhone searches, they claim search on mobile devices is not likely to differ from normal web search once people upgrade to the latest phones. They go on to predict that an important future feature for mobile search will be providing history and personalization synchronized across all of a person's computers and mobile devices.
Some excerpts:
We have consistently found that search patterns on an iPhone closely mimic search patterns on computers, but that mobile search behavior [on older phones] is distinctly different.
We hypothesize that this is due to the easier text entry and more advanced browser capabilities on an iPhone than on mobile phones. Thus we predict that as mobile devices become more advanced, users will treat mobile search as an extension of computer-based search, rather than approaching mobile search as a tool for a distinct subset of information needs.
For [newer] high end phones, we suggest search be a highly integrated experience with computer-based search interfaces .... in terms of personalization and available feature set .... For example, content that was searched for on a computer should be easily accessible through mobile search (through bookmarks, search summaries), and vice versa.
This similarity in queries [also] indicates that we can use the vast wealth of knowledge amassed about conventional computer based search patterns, and apply it to the emerging high-end phone search market, to quickly gain improvements in search quality and user experience.
Thursday, June 18, 2009
Optimizing broad match in web advertising
A paper out of Microsoft Research, "A Data Structure for Sponsored Search" (PDF), has a couple simple but effective optimizations in it that are fun and worth thinking about.
First, a bit of explanation. When trying to match advertisers to search queries, search engines often (90%+ of the time) use broad match, which only requires a subset of the query terms to match. For example, if an advertiser bids for their ads to show on searches for "used books", the ad might also show on searches for [cheap used books], [gently used books], and [used and slightly traumatized books].
You can do this match using a normal inverted index, but it is expensive. For example, on the search for [cheap used books], you need to find all ads that want any of the terms (cheap OR used OR books), then filter out ads that wanted other terms (e.g. an ad that bid on "new books" need to be filtered out at this step). For very popular keywords such as "cheap", many ads can end up being retrieved in the first step only to be filtered in the second.
The clever and nicely simple idea in this paper is to use past data on the frequency of keywords to drop popular terms from the index whenever we possibly can. For example, let's say an advertiser bids for "the great wall of china", we can index the ad only under the lowest frequency word (e.g. "china") and not index under any of the other words. Then, on a search for [the awesomeness], we no longer have to retrieve then filter that ad.
The authors then extend this to deal with bids where all the keywords are popular (e.g. "the thing"). In the paper, they discuss a model that attempts to estimate the cost of throwing multiple words into the index (e.g. indexing "the thing") versus doing the query for each word separately.
The end result is an order of magnitude improvement in performance. All the irrelevant data you get from indexing everything wastefully pulls in many ad candidates that need to be filtered. In this case as well as others, it very much pays off to be careful about what you index.
Please see also my earlier post, "Caching, index pruning, and the query stream", that discusses a SIGIR 2008 paper out of Yahoo Research that explores some vaguely related ideas on index pruning for core search.
First, a bit of explanation. When trying to match advertisers to search queries, search engines often (90%+ of the time) use broad match, which only requires a subset of the query terms to match. For example, if an advertiser bids for their ads to show on searches for "used books", the ad might also show on searches for [cheap used books], [gently used books], and [used and slightly traumatized books].
You can do this match using a normal inverted index, but it is expensive. For example, on the search for [cheap used books], you need to find all ads that want any of the terms (cheap OR used OR books), then filter out ads that wanted other terms (e.g. an ad that bid on "new books" need to be filtered out at this step). For very popular keywords such as "cheap", many ads can end up being retrieved in the first step only to be filtered in the second.
The clever and nicely simple idea in this paper is to use past data on the frequency of keywords to drop popular terms from the index whenever we possibly can. For example, let's say an advertiser bids for "the great wall of china", we can index the ad only under the lowest frequency word (e.g. "china") and not index under any of the other words. Then, on a search for [the awesomeness], we no longer have to retrieve then filter that ad.
The authors then extend this to deal with bids where all the keywords are popular (e.g. "the thing"). In the paper, they discuss a model that attempts to estimate the cost of throwing multiple words into the index (e.g. indexing "the thing") versus doing the query for each word separately.
The end result is an order of magnitude improvement in performance. All the irrelevant data you get from indexing everything wastefully pulls in many ad candidates that need to be filtered. In this case as well as others, it very much pays off to be careful about what you index.
Please see also my earlier post, "Caching, index pruning, and the query stream", that discusses a SIGIR 2008 paper out of Yahoo Research that explores some vaguely related ideas on index pruning for core search.
Wednesday, June 17, 2009
How much can you do with one server?
At a time when many of us are working with thousands of machines, Paul Tyma provides a remarkable example of how much you can do with just one.
Paul runs the clever Mailinator and Talkinator services. Mailinator lets people receive e-mail to arbitrary addresses under mailinator.com, mostly for disposable e-mail for avoiding spammers or annoying registration requirements. Talkinator is an instant messaging system for easily setting up and joining chat rooms.
Both are examples of removing the login friction normally associated with an application (in this case, mail and instant messaging) to discover a new application with different properties and uses.
In an older post, "The Architecture of Mailinator", Paul describes how he optimized the single server running the service to handle 6M e-mails/day. To summarize, the system is custom-built to the task, all unnecessary goo removed, and focuses on robustness while accepting a very small probability of message loss.
Recently, Paul updated that older post. Now they are processing 2M e-mails/hour on a single machine, 3.1T of data per month, and might have to expand to a second machine, not because the one machine cannot handle the load, but because two machines is the cheapest way of getting the extra bandwidth they need.
Paul also posted details on the architecture of Talkinator, which also is highly optimized to run well on a single server. If you take a peek at that, don't miss his amusing second post where he tested running on a 5W tiny plug-in server.
I am not advocating spending as much effort as Paul does on minimizing server costs. But, his work an inspirational counter-example to the common tendency to throw hardware at the problem. It is an enjoyable tale of building much by staying simple, both in the application features and the underlying architecture.
Paul runs the clever Mailinator and Talkinator services. Mailinator lets people receive e-mail to arbitrary addresses under mailinator.com, mostly for disposable e-mail for avoiding spammers or annoying registration requirements. Talkinator is an instant messaging system for easily setting up and joining chat rooms.
Both are examples of removing the login friction normally associated with an application (in this case, mail and instant messaging) to discover a new application with different properties and uses.
In an older post, "The Architecture of Mailinator", Paul describes how he optimized the single server running the service to handle 6M e-mails/day. To summarize, the system is custom-built to the task, all unnecessary goo removed, and focuses on robustness while accepting a very small probability of message loss.
Recently, Paul updated that older post. Now they are processing 2M e-mails/hour on a single machine, 3.1T of data per month, and might have to expand to a second machine, not because the one machine cannot handle the load, but because two machines is the cheapest way of getting the extra bandwidth they need.
Paul also posted details on the architecture of Talkinator, which also is highly optimized to run well on a single server. If you take a peek at that, don't miss his amusing second post where he tested running on a 5W tiny plug-in server.
I am not advocating spending as much effort as Paul does on minimizing server costs. But, his work an inspirational counter-example to the common tendency to throw hardware at the problem. It is an enjoyable tale of building much by staying simple, both in the application features and the underlying architecture.
Tuesday, June 09, 2009
Approaching the limit on recommender systems?
An article at the upcoming UMAP 2009 conference, "I like it... I like it not: Evaluating User Ratings Noise in Recommender Systems" (PDF), looks at inconsistencies when people rates movies.
What is particularly interesting about the article is that the authors argue that "state-of-the-art recommendation algorithms" are nearing the lower bound on accuracy imposed by inconsistencies in ratings, which they and previous work refer to as the "magic barrier".
Natural variability in people's opinions limits how accurate recommender systems can be. According to this paper, we may be almost at that limit already.
Update: For more on this topic, it turns out one of the authors of the paper, Xavier Amatriain, has a blog post, "Netflix Prize: What if there is no Million $ ?", with a good comment thread.
What is particularly interesting about the article is that the authors argue that "state-of-the-art recommendation algorithms" are nearing the lower bound on accuracy imposed by inconsistencies in ratings, which they and previous work refer to as the "magic barrier".
Natural variability in people's opinions limits how accurate recommender systems can be. According to this paper, we may be almost at that limit already.
Update: For more on this topic, it turns out one of the authors of the paper, Xavier Amatriain, has a blog post, "Netflix Prize: What if there is no Million $ ?", with a good comment thread.
Friday, June 05, 2009
On the front lines of the Netflix Prize
Robert Bell, Jim Bennett, Yehuda Koren, and Chris Volinsky have an article in the May 2008 IEEE Spectrum, "The Million Dollar Programming Prize", with fun tales of their work in the Netflix Prize along with a summary of the techniques that are performing best.
Some excerpts:
Some excerpts:
The nearest neighbor method works on the principle that a person tends to give similar ratings to similar movies. [For example, if] Joe likes three movies ... to make a prediction for him, [we] find users who also liked those movies and see what other movies they liked.Please see also my earlier post, "Netflix Prize at KDD 2008", which points at papers with more details than the IEEE article, including another recent paper by Yehuda Koren.
A second, complementary method scores both a given movie and viewer according to latent factors, themselves inferred from the ratings given to all the movies by all the viewers .... Factors for movies may measure comedy versus drama, action versus romance, and orientation to children versus orientation toward adults. Because the factors are determined automatically by algorithms, they may correspond to hard-to-describe concepts such as quirkiness, or they may not be interpretable by humans at all .... The model may use 20 to 40 such factors to locate each movie and viewer in a multidimensional space ... then [we predict] a viewer's rating of a movie according to that movie's score on the dimensions that person cares about most.
Neither approach is a panacea. We found that most nearest-neighbor techniques work best on 50 or fewer neighbors, which means these methods can't exploit all the information a viewer's ratings may contain. Latent-factor models have the opposite weakness: They are bad at detecting strong associations among a few closely related films, such as The Lord of the Rings trilogy.
Because these two methods are complementary, we combined them, using many versions of each in what machine-learning experts call an ensemble approach. This allowed us to build systems that were simple ... easy to code and fast to run.
Another critical innovation involved focusing on which movies a viewer rated, regardless of the scores. The idea is that someone who has rated a lot of fantasy movies will probably like the Lord of the Rings, even if that person has rated the other movies in the category somewhat low ... This approach nicely complemented our other methods.
Thursday, June 04, 2009
The siren song of startups
Over at the blog of the Communications of the ACM, I have a new post on "The Siren Song of Startups".
The article tries to get readers thinking more carefully about why they might and might not want to join a startup.
If you like the post, you might enjoy my other posts over at blog@CACM: "Enjoying Reading Research", "What To Do With Those Idle Cores?", and "What is a Good Recommendation Algorithm?"
The article tries to get readers thinking more carefully about why they might and might not want to join a startup.
If you like the post, you might enjoy my other posts over at blog@CACM: "Enjoying Reading Research", "What To Do With Those Idle Cores?", and "What is a Good Recommendation Algorithm?"
Monday, June 01, 2009
Yahoo CEO Carol Bartz on personalization
Interesting tidbit on personalization in a Q&A with Yahoo CEO Carol Bartz:
Two years ago, Yahoo co-founder Jerry Yang spoke about "better tailoring Yahoo's iconic Web portal to individual users, with the help of technology that predicts what they want."
Four years ago, Yahoo CEO Terry Semel said that one of the "four pillars" of Yahoo was "personalization technology to help users sort through vast choices to find what interests them" and Yahoo executive Lloyd Braun called it one of Yahoo's "secret weapons".
Five years ago, Yahoo CEO Terry Semel said, "Personalization will also play more of a role on the Yahoo home page in the coming months" and "painted a picture in which users could tailor the Yahoo home page to suit their particular interests," adding, "We want the home page to be totally personalized."
Making Yahoo content more relevant and useful to people would be fantastic. Personalization is a way to make the site more relevant and useful. But this idea clearly has been around the halls of Yahoo for some time. The key is going to be executing on it quickly.
Please see also my May 2006 post, "Yahoo home page cries out for personalization".
Update: Three weeks later, Forbes Magazine quotes Carol Bartz as admitting that Yahoo's troubles have been "an execution problem ... we are really working on (moving) from 'we can do this, we can do this' to 'we did do this.'"
She also has several other quotes in that same article that seem quite promising for the future of Yahoo, including that "none of us hate ads ... we just hate crappy ads" and that we should see Yahoo doing personalization and recommendations of news and other home page content soon. Great to hear it.
Yahoo! is the place where millions of people come every day to see what is happening with the people and the things that matter most to them.So Yahoo wants to filter and recommend relevant content based on what it knows about you. Is this a new goal for Yahoo?
That could mean what's happening in the world -- like breaking news, sports scores, stock quotes, last night's TV highlights -- and your world -- like your email, photos, groups, fantasy leagues.
Based on what we know about you ... we can bring you both those worlds. So I think our clear strength is "relevance" -- whether that means knowing what weather to give you or serving up headlines you'll be interested in. It's all about really getting you.
Two years ago, Yahoo co-founder Jerry Yang spoke about "better tailoring Yahoo's iconic Web portal to individual users, with the help of technology that predicts what they want."
Four years ago, Yahoo CEO Terry Semel said that one of the "four pillars" of Yahoo was "personalization technology to help users sort through vast choices to find what interests them" and Yahoo executive Lloyd Braun called it one of Yahoo's "secret weapons".
Five years ago, Yahoo CEO Terry Semel said, "Personalization will also play more of a role on the Yahoo home page in the coming months" and "painted a picture in which users could tailor the Yahoo home page to suit their particular interests," adding, "We want the home page to be totally personalized."
Making Yahoo content more relevant and useful to people would be fantastic. Personalization is a way to make the site more relevant and useful. But this idea clearly has been around the halls of Yahoo for some time. The key is going to be executing on it quickly.
Please see also my May 2006 post, "Yahoo home page cries out for personalization".
Update: Three weeks later, Forbes Magazine quotes Carol Bartz as admitting that Yahoo's troubles have been "an execution problem ... we are really working on (moving) from 'we can do this, we can do this' to 'we did do this.'"
She also has several other quotes in that same article that seem quite promising for the future of Yahoo, including that "none of us hate ads ... we just hate crappy ads" and that we should see Yahoo doing personalization and recommendations of news and other home page content soon. Great to hear it.
Thursday, May 28, 2009
Danny Sullivan on Microsoft search
Danny Sullivan has a couple great posts ([1] [2]) on Microsoft renaming its search engine to Bing, the new features and improvements in relevance coming along with the launch, and what it might mean for Google and Yahoo.
Danny is wise in the ways of search. His words are well worth reading.
Update: To be clear, just because I think Danny's posts are well worth reading doesn't mean I agree with him on everything. In fact, quite the opposite, I usually find the most interesting articles to be the ones that make a convincing argument on something with which I disagree.
One particular point of disagreement I have with Danny is on whether Microsoft's Bing needs to be different or better than Google to succeed. Many, including Danny, have said that being good enough isn't enough because of the power of Google's brand. I'm not sure that's true.
Until now, we haven't been able to tell who is right since Microsoft's search has been noticeably weaker than Google's. But, as Rafe Needleman noted, as did Danny, the name change to Bing comes with substantial improvements to relevance.
So, I think we're about to get the first real test of whether being about as good as Google is enough to see people using Microsoft's search engine.
Update: A week later, Danny Sullivan has a Q&A with Nick Eaton. Interesting how low Danny sets the bar for success for Microsoft's Bing, saying, "They're roughly at 10 percent [market share], I think that they could count themselves really successful if they got themselves to 15 or 20."
Update: Three weeks later, an article in the NY Post claims, "Co-founder Sergey Brin is so rattled by the launch of Microsoft's rival search engine that he has assembled a team of top engineers to work on urgent upgrades to his Web service." Frankly, I find that a little hard to believe, at least as phrased. Seeing as Microsoft is getting closer, I would expect Googlers are redoubling their efforts to stay ahead in core search, but I doubt that Sergey, Larry, or Eric are deeply rattled by some new competition.
Danny is wise in the ways of search. His words are well worth reading.
Update: To be clear, just because I think Danny's posts are well worth reading doesn't mean I agree with him on everything. In fact, quite the opposite, I usually find the most interesting articles to be the ones that make a convincing argument on something with which I disagree.
One particular point of disagreement I have with Danny is on whether Microsoft's Bing needs to be different or better than Google to succeed. Many, including Danny, have said that being good enough isn't enough because of the power of Google's brand. I'm not sure that's true.
Until now, we haven't been able to tell who is right since Microsoft's search has been noticeably weaker than Google's. But, as Rafe Needleman noted, as did Danny, the name change to Bing comes with substantial improvements to relevance.
So, I think we're about to get the first real test of whether being about as good as Google is enough to see people using Microsoft's search engine.
Update: A week later, Danny Sullivan has a Q&A with Nick Eaton. Interesting how low Danny sets the bar for success for Microsoft's Bing, saying, "They're roughly at 10 percent [market share], I think that they could count themselves really successful if they got themselves to 15 or 20."
Update: Three weeks later, an article in the NY Post claims, "Co-founder Sergey Brin is so rattled by the launch of Microsoft's rival search engine that he has assembled a team of top engineers to work on urgent upgrades to his Web service." Frankly, I find that a little hard to believe, at least as phrased. Seeing as Microsoft is getting closer, I would expect Googlers are redoubling their efforts to stay ahead in core search, but I doubt that Sergey, Larry, or Eric are deeply rattled by some new competition.
Tuesday, May 26, 2009
Google Suggest and the right ad
In "Google Kills a Sacred Cow", Anders Bylund at the Motley Fool has an interesting take on Google's decision to show advertising in their search suggestion feature.
An excerpt:
Long ago, when I was at Amazon working on personalization, we used to joke that the ideal Amazon site would not show a search box, navigation links, or lists of things you could buy. Instead, it would just display a giant picture of one book, the next book you want to buy.
When you overwhelm people with choices, the important becomes lost in the mediocrity. Showing as many ads as possible just encourages ad fatigue. The focus of web advertising should be on getting the right ad in the right place. The focus should be on relevance.
An excerpt:
You know how you get suggestions on ... the query you're typing into a Googlish search box? That suggestion list just became context-sensitive, more personalized, and more likely to send you to your destination without ever seeing a page of search results.All it takes is one ad, the right ad in the right place.
Oh, and you'll see advertising in that box, too. If you never see a results page, you might still send revenue Big G's way.
Yahoo recently promised to kill the typical "10 blue links" results in favor of more dynamic presentations. Google immediately took that idea one step further and did away with results altogether for many searches.
While Yahoo, Microsoft's MSN/Live/Kumo/whatever, Time Warner's AOL, and IAC/InterActiveCorp's Ask.com depend on exposing users to as many ads as possible by keeping them on their sites, Google is going the other way. All it takes is one ad -- the right ad in the right place -- and Google's cash flow is secure.
Long ago, when I was at Amazon working on personalization, we used to joke that the ideal Amazon site would not show a search box, navigation links, or lists of things you could buy. Instead, it would just display a giant picture of one book, the next book you want to buy.
When you overwhelm people with choices, the important becomes lost in the mediocrity. Showing as many ads as possible just encourages ad fatigue. The focus of web advertising should be on getting the right ad in the right place. The focus should be on relevance.
Monday, May 18, 2009
The datacenter is the new mainframe
For a while, I have been planning on writing a post comparing large scale clusters with the mainframes of yore, a piece that would be full of colorful references to timesharing, scheduling and renting compute resources, and other tales that would date me as the fossil that I am.
Fortunately, Googlers Luiz Andre Barroso and Urs Holzle recently wrote a fantastic long paper, "The Datacenter as a Computer" (PDF), that not only spares me from this task, but covers it with much more data and insight than I ever could.
Some excerpts:
Do not miss Figure 1.3 that shows latency, bandwidth, and capacity to resources in the data center. It includes an insightful look at latency to local and remote memory and the equivalent latencies but drastically different capacities of local and remote disk for programs running in the cluster. As the authors say, a key challenge is to expose these differences when they matter and hide them when they don't.
There are also thought-provoking tidbits on minimizing interference between jobs on the cluster and maximizing utilization, two goals that often are at odds with each other.
Much of the rest of the paper covers the cost and efficiency issues of data centers more generally, nicely and concisely summarizing much of the recent publicly known work on power, cooling, and infrastructure.
One small thing it does not mention is the ability to rent resources (e.g. EC2) on a WSC, much like buying time on the old mainframes, and the impact that could have on utilization, especially once we allow variable pricing and priorities.
[Google paper found via James Hamilton]
Update: A couple weeks later, security guru Bruce Schneier writes, "Cloud computing is nothing new. It's the modern version of the timesharing model from the 1960s."
Fortunately, Googlers Luiz Andre Barroso and Urs Holzle recently wrote a fantastic long paper, "The Datacenter as a Computer" (PDF), that not only spares me from this task, but covers it with much more data and insight than I ever could.
Some excerpts:
New large datacenters ... cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in [our] facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment.The paper goes on to describe the challenges of making an entire datacenter behave like it is one large compute resource to applications and programmers, including discussing the existing application frameworks and need for further tools.
In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC).
Much like an operating system layer is needed to manage resources and provide basic services in a single computer, a system composed of thousands of computers, networking, and storage also requires a layer of software that provides an analogous functionality at this larger scale.
[For example] resource management ... controls the mapping of user tasks to hardware resources, enforces priorities and quotas, and provides basic task management services. Nearly every large-scale distributed application needs ... reliable distributed storage, message passing, and cluster-level synchronization.
Do not miss Figure 1.3 that shows latency, bandwidth, and capacity to resources in the data center. It includes an insightful look at latency to local and remote memory and the equivalent latencies but drastically different capacities of local and remote disk for programs running in the cluster. As the authors say, a key challenge is to expose these differences when they matter and hide them when they don't.
There are also thought-provoking tidbits on minimizing interference between jobs on the cluster and maximizing utilization, two goals that often are at odds with each other.
Much of the rest of the paper covers the cost and efficiency issues of data centers more generally, nicely and concisely summarizing much of the recent publicly known work on power, cooling, and infrastructure.
One small thing it does not mention is the ability to rent resources (e.g. EC2) on a WSC, much like buying time on the old mainframes, and the impact that could have on utilization, especially once we allow variable pricing and priorities.
[Google paper found via James Hamilton]
Update: A couple weeks later, security guru Bruce Schneier writes, "Cloud computing is nothing new. It's the modern version of the timesharing model from the 1960s."
Monday, May 11, 2009
The potential of behavioral targeted advertising
An important WWW 2009 paper out of MSR Asia, "How much can Behavioral Targeting Help Online Advertising?" (PDF), looks at how much value we can get from targeting ads to past behavior. It is a must read for anyone interested in personalized advertising.
Some excerpts:
Another issue is that they explicitly did not look at using demographic data or locality. It is possible that many of the BT user segments might be grouped roughly by locality, having derived it from search or browsing behavior. If that is true, then much of the gains they saw from BT could be reaped much more easily by targeting the ads to implicit local information. And, if we target to locality, then the additional gains we could expect from behavior targeting then might be much smaller.
But, limitations aside, it is a great paper. The authors clearly and cleanly state their key question -- what is the value of behavioral targeting for advertising? -- and then analyzes a massive historical log to convincingly derive the likely value. It also provides much guidance for those who might seek to build these or similar systems.
The paper has an interesting conclusion that recent search queries are the most useful indicators of people's interests when targeting ads. Yahoo's Andrei Broder said something similar recently when thinking about targeting advertising. It is also worth noting that others who were looking at the value of personalized search came to similar conclusions ([1] [2]).
For more on fine-grained personalized advertising, please see my earlier posts, "What to advertise when there is no commercial intent?" and "A brief history of Findory".
Some excerpts:
To our best knowledge, this work is the first systematic study for [behavioral targeting] (BT) on real world ad click-through [logs].One issue with the study is that it looks only at coarse-grained user segments, at most 160 segments, not fine-grained, one-to-one, personalized advertising. I would suspect there would be added benefit from fine-grained, personalized targeting to past behavior rather than clustering users into large groups. In fact, as Figure 5 in the paper shows, they do not appear to have hit the point of diminishing returns even on splitting into more segments.
[We] empirically answer ... whether BT truly as the ability to help online advertising ... how much BT can help ... [and] which BT strategy can work better than others.
We observe that the users who clicked the same ad can be over 90 times more similar than the users who clicked different ads .... [which verifies] the basic assumption of BT.
We observe that .... the ads CTR can be improved as high as 670% by the simple user segmentation strategies used for behavioral targeted advertising .... [More] advanced user representation and user segmentation algorithms [yielded improvements of] more than 1,000%.
Through comparing different user representation strategies for BT, we draw the conclusion that user search behavior, i.e. user search queries, can perform several times better than user browsing behavior, i.e. user clicked pages. Moreover, only tracking the short term user behaviors are more effective than tracking the long term user behaviors.
Another issue is that they explicitly did not look at using demographic data or locality. It is possible that many of the BT user segments might be grouped roughly by locality, having derived it from search or browsing behavior. If that is true, then much of the gains they saw from BT could be reaped much more easily by targeting the ads to implicit local information. And, if we target to locality, then the additional gains we could expect from behavior targeting then might be much smaller.
But, limitations aside, it is a great paper. The authors clearly and cleanly state their key question -- what is the value of behavioral targeting for advertising? -- and then analyzes a massive historical log to convincingly derive the likely value. It also provides much guidance for those who might seek to build these or similar systems.
The paper has an interesting conclusion that recent search queries are the most useful indicators of people's interests when targeting ads. Yahoo's Andrei Broder said something similar recently when thinking about targeting advertising. It is also worth noting that others who were looking at the value of personalized search came to similar conclusions ([1] [2]).
For more on fine-grained personalized advertising, please see my earlier posts, "What to advertise when there is no commercial intent?" and "A brief history of Findory".
Wednesday, May 06, 2009
Exploiting spammers to make computers smarter
Googlers Rich Gossweiler, Maryam Kamvar, and Shumeet Baluja had a fun paper at WWW 2009, "What's Up CAPTCHA? A CAPTCHA Based On Image Orientation" (PDF), that asks people to rotate images correctly to prove they are human rather than the norm of deciphering distorted text.
Some brief excerpts from their paper:
But, as Luis von Ahn insightfully pointed out in a recent interview in New Scientist, it is a perfectly fine outcome if spammers find a way to break this new image-based CAPTCHA technique. By doing so, they are helping us make computers smarter.
From the New Scientist article:
Some brief excerpts from their paper:
We present a novel CAPTCHA which requires users to adjust randomly rotated images to their upright orientation ... Rotating images to their upright orientation is a difficult task for computers .... [Our] system ... results in a 84% human success rate and .009% bot success rate.The paper goes on to say that "no algorithm has yet been developed to successfully rotate the set of images used in our CAPTCHA system." The key word there is "yet." As soon as there is a strong incentive for people to develop better algorithms for this problem, better algorithms will be developed.
The main advantages of our CAPTCHA technique over traditional text recognition techniques are that it is language-independent, does not require text-entry (e.g. for mobile devices), and employs another domain for CAPTCHA generation beyond character obfuscation.
But, as Luis von Ahn insightfully pointed out in a recent interview in New Scientist, it is a perfectly fine outcome if spammers find a way to break this new image-based CAPTCHA technique. By doing so, they are helping us make computers smarter.
From the New Scientist article:
"If [the spammers] are really able to write a programme to read distorted text, great – they have solved an AI problem," says von Ahn. The criminal underworld has created a kind of X prize for OCR.
Security groups ... [then] can ... switch for an alternative CAPTCHA system -- based on images, for example -- presenting the eager spamming community with a new AI problem to crack ... Image orientation is difficult for computers. But if [image-based] CAPTCHA becomes common, it won't be long before spammers turn their attention to cracking the problem, with potential fringe benefits to cameras and image editing software.
Speech recognition CAPTCHAs are already being used, and image labelling ones could follow, says von Ahn. AI researchers are already working in both these areas, but they could soon be joined by spammers also helping advance the technology.
Perhaps it is time to start designing CAPTCHAs in a different way -- pick problems that need solving and make them into targets to be solved by resourceful criminals.
Tuesday, May 05, 2009
More Microsoft layoffs
Sunday, April 26, 2009
Leaving Microsoft
With the dissolution of much of Live Labs, I have decided to resign from Microsoft. I will be leaving at the end of April.
Working at Microsoft on search and advertising turned out to be a lot of fun. High impact, useful, and interesting problems are everywhere.
For example, during my time there, I had a chance to work a bit on advertising relevance (for work with similar motivation, see [1] in Section 6.2 and [2]), search relevance (closest public example of something vaguely similar might be [1]), improving the quality of human judgments (vaguely similar to the ideas published in [1] and [2]), looking at new evaluation methods for search (motivated by [1]), ubiquitous online experimentation (same goals as ExP), personalized web search (like Findory and motivated by [1] and [2]), personalized advertising (see [1]), and large scale data analyses (see [1]).
And, as fun as the problems were the people. I had a chance to talk with so many at Microsoft, from celebrated researchers to the hard-working talent pounding on the code. It was very enjoyable to work on such a breadth of problems and with so many different people. I did much, learned much, and I will miss it.
As for next steps, I will be taking some time before settling into anything new. I still hold the same passion for taming information overload, for personalizing the data streams of the Web to make them relevant, helpful, and useful. Whoever manages to change the nature of content display on the Web from a search problem to a recommender problem will reap tremendous rewards. I hope to play my part in that shift.
Working at Microsoft on search and advertising turned out to be a lot of fun. High impact, useful, and interesting problems are everywhere.
For example, during my time there, I had a chance to work a bit on advertising relevance (for work with similar motivation, see [1] in Section 6.2 and [2]), search relevance (closest public example of something vaguely similar might be [1]), improving the quality of human judgments (vaguely similar to the ideas published in [1] and [2]), looking at new evaluation methods for search (motivated by [1]), ubiquitous online experimentation (same goals as ExP), personalized web search (like Findory and motivated by [1] and [2]), personalized advertising (see [1]), and large scale data analyses (see [1]).
And, as fun as the problems were the people. I had a chance to talk with so many at Microsoft, from celebrated researchers to the hard-working talent pounding on the code. It was very enjoyable to work on such a breadth of problems and with so many different people. I did much, learned much, and I will miss it.
As for next steps, I will be taking some time before settling into anything new. I still hold the same passion for taming information overload, for personalizing the data streams of the Web to make them relevant, helpful, and useful. Whoever manages to change the nature of content display on the Web from a search problem to a recommender problem will reap tremendous rewards. I hope to play my part in that shift.
Saturday, April 25, 2009
Google server and data center details
At the Efficient Data Center Summit, Google and others were discussing techniques to reduce energy consumption from massive clusters. In the process, the Googlers offered some very fun peeks into how they designed some of their servers and data centers.
For example, Chris Malone and Ben Jai offered a talk, "Insights Into Google's PUE", that, starting on slide 8, describes how Google uses single volt power and on-board uninterruptible power supply to raise efficiency at the motherboard from the norm of 65-85% to 99.99%. There is a picture of the board on slide 17.
Amazon's James Hamilton attended the conference and elaborated on this:
For more details, Amazon's James Hamilton has additional notes ([1] [2] [3]) and CNet's Stephen Shankland links to all the videos (including video of the talks at the summit).
For example, Chris Malone and Ben Jai offered a talk, "Insights Into Google's PUE", that, starting on slide 8, describes how Google uses single volt power and on-board uninterruptible power supply to raise efficiency at the motherboard from the norm of 65-85% to 99.99%. There is a picture of the board on slide 17.
Amazon's James Hamilton attended the conference and elaborated on this:
The server design Google showed was clearly a previous generation ... a 2005 board ... [but] was a very nice design.Also fun is a video tour of one of Google's data centers. The video is short and worth watching for a look at how cooling and wiring is done these days as well as checking out how they used servers in shipping containers.
The board is a 12volt only design ... 12V only supplies are simpler, distributing on-board the single voltage is simpler and more efficient, and distribution losses are lower.
The most innovative aspect of the board design is the use of a distributed UPS. Each board has a 12V VRLA battery that can keep the server running for 2 to 3 minutes during power failures. This is plenty of time to ride through the vast majority of power failures ... [and] it avoids the expensive [and less efficient] central UPS system.
The server was designed to be rapidly serviced with the power supply, disk drives, and battery all being Velcro attached and easy to change quickly.
For more details, Amazon's James Hamilton has additional notes ([1] [2] [3]) and CNet's Stephen Shankland links to all the videos (including video of the talks at the summit).
Friday, April 24, 2009
Serendipity, diversity, and personalized search
Aside from the amusing double entendre in its title, a recent paper out of Microsoft Research, "From X-Rays to Silly Putty via Uranus: Serendipity and its Role in Web Search" (PDF), is notable for its take on two topics that seems to be attracting increasing attention lately, personalized search and improving the diversity of search results.
Some excerpts:
For a discussion of yet another benefit, reducing the payoff to web spammers, please see also my July 2006 post, "Combating web spam with personalization".
Some excerpts:
Partially-relevant search results, identified as "containing multiple concepts, [or] on target but too narrow," play an important role in a user's information seeking process and problem definition.So, the paper suggests that there may be multiple benefits to personalized search. Not only do we get the benefits of improved understanding of query intent and increased relevance, but also we can improve diversity and discovery.
By studying Web search query logs and the results people judge relevant and interesting, we find many of the queries people perform return interesting (potentially serendipitous) results that are not directly relevant .... More than a fifth of all search results were judged interesting but not highly relevant to the search task.
Serendipity was more likely to occur in diverse result sets .... Personalization scores correlate with both relevance and also with interestingness, suggesting that information about personal interests and behaviour may be used to support serendipity.
For a discussion of yet another benefit, reducing the payoff to web spammers, please see also my July 2006 post, "Combating web spam with personalization".
Friday, April 10, 2009
MapReduce using Amazon's cluster and differential pricing
Amazon recently launched Elastic MapReduce, a web service that lets people run MapReduce jobs on Amazon's cluster.
Elastic MapReduce appears to handle almost all the details for you. You upload data to S3, then run a MapReduce job. All the work of firing up EC2 instances, getting Hadoop on them, getting the data out of S3, and putting the results back in S3 appears to be done for you. Pretty cool.
Even so, I have a big problem with this new service, the pricing. MapReduce jobs are batch jobs that could run at idle times on the cluster, but there appears to be no effort to run these during idle times nor is there any discount on the pricing. In fact, you actually pay a premium for MapReduce jobs above the cost of the EC2 instances used during the job.
It is a huge missed opportunity. Smoothing out peaks and troughs in cluster load improves efficiency. Using the idle time of machines in Amazon's EC2 cluster should be essentially free. The hardware and infrastructure costs are all sunk. In a non-peak time, only the marginal cost of the additional electricity used by a busy box over an idle box is a true cost.
What Amazon should be doing is offer a steep discount on EC2 pricing for interruptible batch jobs like MapReduce jobs, then only run those jobs in the idle capacity of non-peak times. This would allow Amazon to smooth the load on their cluster and improve utilization while passing on the savings to others.
For more on this topic, please see also my Jan 2007 post, "I want differential pricing for Amazon EC2".
Please see also Amazon VP James Hamilton's recent post, "Resource Consumption Shaping", which also talks about smoothing load on a cluster. Note that James argues that the marginal cost of making an idle box busy is near zero because of the way power and network use is billed (at the 95th percentile).
For some history on past efforts to run Hadoop on EC2, please see my Nov 2006 post, "Hadoop on Amazon EC2".
Elastic MapReduce appears to handle almost all the details for you. You upload data to S3, then run a MapReduce job. All the work of firing up EC2 instances, getting Hadoop on them, getting the data out of S3, and putting the results back in S3 appears to be done for you. Pretty cool.
Even so, I have a big problem with this new service, the pricing. MapReduce jobs are batch jobs that could run at idle times on the cluster, but there appears to be no effort to run these during idle times nor is there any discount on the pricing. In fact, you actually pay a premium for MapReduce jobs above the cost of the EC2 instances used during the job.
It is a huge missed opportunity. Smoothing out peaks and troughs in cluster load improves efficiency. Using the idle time of machines in Amazon's EC2 cluster should be essentially free. The hardware and infrastructure costs are all sunk. In a non-peak time, only the marginal cost of the additional electricity used by a busy box over an idle box is a true cost.
What Amazon should be doing is offer a steep discount on EC2 pricing for interruptible batch jobs like MapReduce jobs, then only run those jobs in the idle capacity of non-peak times. This would allow Amazon to smooth the load on their cluster and improve utilization while passing on the savings to others.
For more on this topic, please see also my Jan 2007 post, "I want differential pricing for Amazon EC2".
Please see also Amazon VP James Hamilton's recent post, "Resource Consumption Shaping", which also talks about smoothing load on a cluster. Note that James argues that the marginal cost of making an idle box busy is near zero because of the way power and network use is billed (at the 95th percentile).
For some history on past efforts to run Hadoop on EC2, please see my Nov 2006 post, "Hadoop on Amazon EC2".
Thursday, March 26, 2009
Semantic interpretation and the effectiveness of big data
Googlers Alon Halevy, Peter Norvig, and Fernando Pereira have an article, "The Unreasonable Effectiveness of Data" (PDF), in the April 2009 IEEE Intelligent Systems on semantic interpretation using big data.
Some excerpts:
On a related note, Google announced some new features a couple days ago, improved query suggestions and snippets, that Googler Ori Allon apparently described as scanning pages "in real-time ... after a query is entered" and identifying "conceptually and contextually related sites/pages" using "an 'understanding' of content and context." Many news articles are referring to this as a step toward semantic search.
Please see also my April 2008 post, "GoogleBot starts on the deep web", which discusses related work by Alon Halevy on mining data in tables and the deep web.
Please see also my post on the WSDM 2008 keynote by Oren Etzioni on semantic interpretation. His work is mentioned a few times by Halevy et al.
[IEEE article found via the Google Research Blog]
Some excerpts:
The number of grammatical English sentences is theoretically infinite ... However, in practice we humans care to make only a finite number of distinctions. For many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need.The article talks in more detail about work at Google and elsewhere on extracting relationships from massive crawls of text, tables, and the deep web.
We're left with ... interpreting the content, which is mainly that of learning as much as possible about the context of the content to correctly disambiguate it .... What we need are methods to infer relationships between ... entities in the world. These inferences may be incorrect at times, but if they're done well enough we can connect disparate data collections and thereby substantially enhance our interaction with Web data.
Unlabeled data ... is so much more plentiful than labeled data ... With very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there.
On a related note, Google announced some new features a couple days ago, improved query suggestions and snippets, that Googler Ori Allon apparently described as scanning pages "in real-time ... after a query is entered" and identifying "conceptually and contextually related sites/pages" using "an 'understanding' of content and context." Many news articles are referring to this as a step toward semantic search.
Please see also my April 2008 post, "GoogleBot starts on the deep web", which discusses related work by Alon Halevy on mining data in tables and the deep web.
Please see also my post on the WSDM 2008 keynote by Oren Etzioni on semantic interpretation. His work is mentioned a few times by Halevy et al.
[IEEE article found via the Google Research Blog]
Subscribe to:
Posts (Atom)
