Sean Rowe at MSN posts that MSN Virtual Earth is demoing what they call "street-side drive-by", a nifty UI on top of street-level photos from the front and side of some locations in some major cities.
Though limited in scope, buggy, and IE only, this new feature looks like it could be a major one-up on the BlockView images in A9 Maps. When it works, it's a super-slick, jaw-dropping experience, very fun.
Tuesday, February 28, 2006
Monday, February 27, 2006
Different visions of the future of search
Ingrid Marson at ZDNet nicely captures the differences between the visions of MSN, Yahoo, and Google for the future of web search.
Saleel Sathe (Lead PM, MSN Search) argues that searchers will have to change their behavior and learn how to use search better:
But all of these strategies face formidable challenges.
There certainly is promise in treating search as a dialogue -- an iterative process rather than a one-shot deal -- but I think any attempt by MSN to get users to do more work is doomed from the start. People are lazy, appropriately so. They want what they want and they want it now. If you don't find it for them quickly and easily, they'll switch to a tool that will.
Yahoo's social search, as I've said before, faces two major hurdles: spam and non-participation.
Spam, oh glorious spam. Letting ordinary users decide what others see is great until those ordinary users discover the profit in promoting their own sites. Whether the wisdom of the crowd can overpower the hucksters of the bazaar remains to be seen.
And these lofty notions of replacing a supposed "vote by proxy" tyranny of webmasters with a democracy of widespread participation will succumb to the reality that, again, users are lazy, and few will participate if it requires effort. At best, vote by one proxy will be replaced by vote by another proxy, something Bradley himself has acknowledged.
But, Google's chosen path is a daunting one as well. It is very hard to figure out what you want in the face of limited, noisy information about what you actually want. The key to this likely will be to use all information available, including your history of searches, but development of scalable, high quality, personalized search is still in the baby steps of its infancy.
No matter how you look at it, it's an exciting time for search. Much will happen in the coming years.
Saleel Sathe (Lead PM, MSN Search) argues that searchers will have to change their behavior and learn how to use search better:
"Search engines have shot themselves in the foot by providing a search box, where users provide relatively little information," Sathe said.Bradley Horowitz (head of Technology Development Group at Yahoo Search) argues that we should throw out existing web search and replace it with social search:
"The average search query is 2.3 words... but if you asked a librarian for information you would not just give them 2.3 words -- you would give them the opportunity to give you the rich detailed answer you want."
"Where is the next big breakthrough that gets beyond PageRank? PageRank confers a privilege to Webmasters who vote by proxy for all of us."Matthew Glotzbach (Director, Google) argues that the computer should do the work and figure out what people need from whatever information is available:
"What we think is the next major breakthrough is social search. It basically democratises the notion of relevance and lets ordinary users decide what's important for themselves and other users," said Horowitz.
"In the distant future we will not be able to get you to take more action ... We will get close enough with what you give us. A lot of emphasis will continue on doing that in the background -- getting the technology to figure out [what you want]," [Matthew] said.MSN (and, until recently, A9) wants to give you more powerful tools. Yahoo wants the community of users to help improve search. Google wants computers to do all the work to get you what you need.
"Larry Page [the co-founder] of Google often says, 'the perfect search engine would understand exactly what you mean and give back exactly what you want'."
But all of these strategies face formidable challenges.
There certainly is promise in treating search as a dialogue -- an iterative process rather than a one-shot deal -- but I think any attempt by MSN to get users to do more work is doomed from the start. People are lazy, appropriately so. They want what they want and they want it now. If you don't find it for them quickly and easily, they'll switch to a tool that will.
Yahoo's social search, as I've said before, faces two major hurdles: spam and non-participation.
Spam, oh glorious spam. Letting ordinary users decide what others see is great until those ordinary users discover the profit in promoting their own sites. Whether the wisdom of the crowd can overpower the hucksters of the bazaar remains to be seen.
And these lofty notions of replacing a supposed "vote by proxy" tyranny of webmasters with a democracy of widespread participation will succumb to the reality that, again, users are lazy, and few will participate if it requires effort. At best, vote by one proxy will be replaced by vote by another proxy, something Bradley himself has acknowledged.
But, Google's chosen path is a daunting one as well. It is very hard to figure out what you want in the face of limited, noisy information about what you actually want. The key to this likely will be to use all information available, including your history of searches, but development of scalable, high quality, personalized search is still in the baby steps of its infancy.
No matter how you look at it, it's an exciting time for search. Much will happen in the coming years.
Manual vs. automated tagging
Rich Skrenta of Topix.net posted a good critique of manual tagging of documents:
Right now, these tools mostly are used by early adopters. This small, dedicated, loyal audience tends to behave well because there is little incentive to do otherwise.
If the tools become more popular, the incentive to manipulate them will increase. The community-generated tags will become less and less reliable as spammers enter seeking traffic and profit.
This will be the real challenge to manual document tagging. It remains to be seen whether the wisdom of the crowd can prevail over the deceptions of the scammers.
Tags aren't a panacea, since they're excessively vulnerable to spam, and the items which should belong to the same categories will get different tags from different users. Which is it, "topixnet"? or "topix"?This reminds me of what Danny Sullivan said about manually tagging documents:
They're uniquely valuable in a system like Flickr since photos don't have any text of their own to keyword search, so getting the user to add any searchable text at all is a big win. You can ask users to caption their photos but often putting just a word or two is easier so the participation level is higher.
But if you have the full text of the web, or blogosphere, or whatever, the marginal utility of the "keywords" tag on the document seems to be rather low. To deal with spam and relevance issues, the search interface for a large collection needs to be appropriately skeptical about what documents are claiming to be about.
All the interest (dare I say hype) is largely ignoring the fact that we've had tagging on the web for going on 10 years, and the experience on the search side is that it can't be trusted.And later went on to say:
The meta keywords tag has been around for nearly a decade. The idea behind it in part was that people could use the tag to classify what their pages are about.
The data is largely useless ... Thinking that tagging would lead to top rankings, some people misused the tag. Other people didn't misuse the tag intentionally, but they might poorly describe their pages.
Wide-open tagging, where anyone can get their pages to the top of a list just by labeling it so, is going to be a giant spam magnet.Stephen Green at Sun summarizes it well:
[Tagging is] not really a new way of indexing documents, it's actually an old way that didn't work very well.The real test of manual tagging of documents will be when these tagging tools become large enough to drive substantial traffic to websites.
Right now, these tools mostly are used by early adopters. This small, dedicated, loyal audience tends to behave well because there is little incentive to do otherwise.
If the tools become more popular, the incentive to manipulate them will increase. The community-generated tags will become less and less reliable as spammers enter seeking traffic and profit.
This will be the real challenge to manual document tagging. It remains to be seen whether the wisdom of the crowd can prevail over the deceptions of the scammers.
Saturday, February 25, 2006
Amazon and personalized ads
In a Red Herring article, "Amazon's A9 Mystery", the author asks:
See also my previous post, "Amazon version of AdSense?"
What's stopping Amazon from capitalizing on a system for pairing data on consumer buying habits with search engine-powered ads?Good question.
See also my previous post, "Amazon version of AdSense?"
Thursday, February 23, 2006
Google Voice Search
I finally got around to reading the "Searching the Web by Voice" paper that describes the techniques behind the now defunct Google Voice Search.
It's an interesting idea. Apparently, when this demo was active, you could call a phone number, speak the Google Search you want performed, and then hear some summary of the search results, all over your phone.
But, the paper makes it clear that this is very hard to do. They have to understand an unrestricted vocabulary, cut through noise and accents, and do it all in real-time. The paper reports that their accuracy -- accurate transcriptions of what was said -- was below 50%. Improving accuracy was blocked by the fact that more complicated models took too much time for real-time responses.
Sounds like a fun project. Perhaps an insurmountable challenge, but that shouldn't mean it isn't worth pursuing.
It's an interesting idea. Apparently, when this demo was active, you could call a phone number, speak the Google Search you want performed, and then hear some summary of the search results, all over your phone.
But, the paper makes it clear that this is very hard to do. They have to understand an unrestricted vocabulary, cut through noise and accents, and do it all in real-time. The paper reports that their accuracy -- accurate transcriptions of what was said -- was below 50%. Improving accuracy was blocked by the fact that more complicated models took too much time for real-time responses.
Sounds like a fun project. Perhaps an insurmountable challenge, but that shouldn't mean it isn't worth pursuing.
Wednesday, February 22, 2006
New Official Google Research Blog
There's a new weblog from Google, the "Official Google Research Blog".
The first post is from AI guru Peter Norvig, who I believe used to be Director of Search Quality but apparently now is a Director of Google Research.
On the reason for starting the blog, he says, "We've been asked what Google Research is like, and we thought the best way to answer is with a blog."
Sounds like we'll be seeing posts from many of the top researchers at Google. I hope that includes people like Steve Lawrence, Krishna Bharat, and Rob Pike.
I also hope we'll be seeing more pointers to and discussion of recent papers and technical reports produced by Google. The current list of papers available from Google Labs seems like it is getting out of date, with no papers from 2005, only one from 2004, and only one from 2003.
[Found via Barry Schwartz]
The first post is from AI guru Peter Norvig, who I believe used to be Director of Search Quality but apparently now is a Director of Google Research.
On the reason for starting the blog, he says, "We've been asked what Google Research is like, and we thought the best way to answer is with a blog."
Sounds like we'll be seeing posts from many of the top researchers at Google. I hope that includes people like Steve Lawrence, Krishna Bharat, and Rob Pike.
I also hope we'll be seeing more pointers to and discussion of recent papers and technical reports produced by Google. The current list of papers available from Google Labs seems like it is getting out of date, with no papers from 2005, only one from 2004, and only one from 2003.
[Found via Barry Schwartz]
Tuesday, February 21, 2006
Early Amazon: Interviews
Fast growth means lots of hiring. And there was a lot of hiring at Amazon. At many times, I was doing 1-3 interviews every day.
Different people had different interview techniques. I spent a lot of time refining mine. I did much reading on different strategies, tried borrowing ideas from Microsoft and elsewhere, and pulled other Amazonians into long discussions and debates about hiring.
In the end, I settled on looking for three things: enthusiasm, creativity, competence.
On enthusiasm, I wanted to see that the candidate had done at least minimal research into Amazon, the business, and the website. I was shocked by how many people came in to interview at Amazon who had never used the website. Here you are, considering spending the next few years at Amazon, and you didn't do some basic investigation beforehand? C'mon, that's just lame.
On creativity, I usually looked for ideas on how to improve the site. It didn't matter if we had already thought of the ideas. I realize how hard it is to come up with cool ideas from the outside. But I wanted to see some exploration of how things could be done better.
On competence, I merely attempted to verify what they said on their resume. If they claimed to be an expert in C++, could they answer a couple introductory level questions about C++? If they claim to know Python, can they write a trivial 5-10 line program in Python? If they have a degree in computer science, can they talk about algorithms and data structures? On some project they said they did in the past, can they talk about it in depth and with enthusiasm for all the little details?
By the way, exploring someone's knowledge doesn't necessarily require knowledge of it yourself. You can just keep asking questions, diving deeper and deeper. If they really understand the problem, they should be able to explain it to others, to teach people about the problem.
Eventually, you should get to a point where they say "I don't know" to a question. That's a great sign. Knowing what you know isn't as important as knowing what you don't know. It is a sign of real understanding when someone can openly discuss where their knowledge ends.
Reading over these questions now, it may sound like an easy filter. Minimal enthusiasm for and research of Amazon.com, a couple creative ideas on improving Amazon, and don't lie or exaggerate on your resume. Wouldn't everyone pass that?
As it turns out, no. Offhand, I'd guess that I rejected about 90% of candidates during phone screens and another 90% during face-to-face interviews. A filter for enthusiastic, creative, competent people seemed to be a surprisingly harsh one.
Of these three things, I think the single biggest predictor of success at Amazon.com was enthusiasm.
Amazon was a chaotic environment. Getting things done required initiative. Amazon needed people who would grab problems by the throat and never let go. Part of my work every day was to find them.
Different people had different interview techniques. I spent a lot of time refining mine. I did much reading on different strategies, tried borrowing ideas from Microsoft and elsewhere, and pulled other Amazonians into long discussions and debates about hiring.
In the end, I settled on looking for three things: enthusiasm, creativity, competence.
On enthusiasm, I wanted to see that the candidate had done at least minimal research into Amazon, the business, and the website. I was shocked by how many people came in to interview at Amazon who had never used the website. Here you are, considering spending the next few years at Amazon, and you didn't do some basic investigation beforehand? C'mon, that's just lame.
On creativity, I usually looked for ideas on how to improve the site. It didn't matter if we had already thought of the ideas. I realize how hard it is to come up with cool ideas from the outside. But I wanted to see some exploration of how things could be done better.
On competence, I merely attempted to verify what they said on their resume. If they claimed to be an expert in C++, could they answer a couple introductory level questions about C++? If they claim to know Python, can they write a trivial 5-10 line program in Python? If they have a degree in computer science, can they talk about algorithms and data structures? On some project they said they did in the past, can they talk about it in depth and with enthusiasm for all the little details?
By the way, exploring someone's knowledge doesn't necessarily require knowledge of it yourself. You can just keep asking questions, diving deeper and deeper. If they really understand the problem, they should be able to explain it to others, to teach people about the problem.
Eventually, you should get to a point where they say "I don't know" to a question. That's a great sign. Knowing what you know isn't as important as knowing what you don't know. It is a sign of real understanding when someone can openly discuss where their knowledge ends.
Reading over these questions now, it may sound like an easy filter. Minimal enthusiasm for and research of Amazon.com, a couple creative ideas on improving Amazon, and don't lie or exaggerate on your resume. Wouldn't everyone pass that?
As it turns out, no. Offhand, I'd guess that I rejected about 90% of candidates during phone screens and another 90% during face-to-face interviews. A filter for enthusiastic, creative, competent people seemed to be a surprisingly harsh one.
Of these three things, I think the single biggest predictor of success at Amazon.com was enthusiasm.
Amazon was a chaotic environment. Getting things done required initiative. Amazon needed people who would grab problems by the throat and never let go. Part of my work every day was to find them.
Monday, February 20, 2006
Small new features at Findory
In response to a few suggestions and requests, I've added a few small new features to Findory.
First, several people who enjoy the Findory feed reader, Findory Favorites, wanted an RSS feed for the Top Stories from My Favorites.
Top Stories from My Favorites is a personalized selection of articles picked from your favorite feeds. It's designed to emphasize interesting articles based on your reading habits rather than forcing you to read every single post from every single blog in your feed reader.
To use this nifty new RSS feed, go to Findory Favorites by clicking on the "[N] Favorites" link at the top right corner of every page on Findory. If you don't have any favorites yet, you'll want to list your favorites by upload an OPML file, import from Bloglines, add RSS feeds individually, or pick favorite sources by clicking source names on the Findory site. Make sure you are signed in, then look for the "RSS" button at the bottom of the Favorites page.
Second, others who like Findory Favorites wanted to see more top stories than the 20 or so we list by default. So, Findory Favorites now allows you to read up to 200 stories, 20 at a time. Look for the "Read more" link at the bottom of the Favorites page.
Third, I've seen several requests for clustering in Findory similar to the clustering seen in Google News or Memeorandum. I added a first step toward exposing some of our clustering data in a new feature called Findory Similar Articles. You can see it from the article landing page you see when you click on any article from a Findory RSS feed or from Findory Inline.
For example, on Findory's article page for my earlier post, "Google and The Happy Searcher", there is a link to the cloud of similar articles.
Fourth, for no particularly good reason, I added a feature called Findory Tags that automatically extracts common keywords used by a news source or weblog. For example, see the tag cloud for this weblog or for Bruce Schneier's security weblog, Schneier on Security.
A link to both Findory Neighbors (which shows similar sources) and Findory Tags is on every source page. If your blog is in Findory's database, you can see your neighbors and tags. If you aren't listed yet and want to be, please enter a request to add your blog.
Hope you enjoy the new features!
First, several people who enjoy the Findory feed reader, Findory Favorites, wanted an RSS feed for the Top Stories from My Favorites.
Top Stories from My Favorites is a personalized selection of articles picked from your favorite feeds. It's designed to emphasize interesting articles based on your reading habits rather than forcing you to read every single post from every single blog in your feed reader.
To use this nifty new RSS feed, go to Findory Favorites by clicking on the "[N] Favorites" link at the top right corner of every page on Findory. If you don't have any favorites yet, you'll want to list your favorites by upload an OPML file, import from Bloglines, add RSS feeds individually, or pick favorite sources by clicking source names on the Findory site. Make sure you are signed in, then look for the "RSS" button at the bottom of the Favorites page.
Second, others who like Findory Favorites wanted to see more top stories than the 20 or so we list by default. So, Findory Favorites now allows you to read up to 200 stories, 20 at a time. Look for the "Read more" link at the bottom of the Favorites page.
Third, I've seen several requests for clustering in Findory similar to the clustering seen in Google News or Memeorandum. I added a first step toward exposing some of our clustering data in a new feature called Findory Similar Articles. You can see it from the article landing page you see when you click on any article from a Findory RSS feed or from Findory Inline.
For example, on Findory's article page for my earlier post, "Google and The Happy Searcher", there is a link to the cloud of similar articles.
Fourth, for no particularly good reason, I added a feature called Findory Tags that automatically extracts common keywords used by a news source or weblog. For example, see the tag cloud for this weblog or for Bruce Schneier's security weblog, Schneier on Security.
A link to both Findory Neighbors (which shows similar sources) and Findory Tags is on every source page. If your blog is in Findory's database, you can see your neighbors and tags. If you aren't listed yet and want to be, please enter a request to add your blog.
Hope you enjoy the new features!
Sunday, February 19, 2006
Blogger and database inconsistency
Blogger recently had trouble with some of the databases that support the service.
Apparently, some of the boxes in their database cluster had the right data and some did not. This is a relatively common problem in a database cluster. It can happen if replication fails because the network gets partitioned or whatever.
The normal solution is to backfill the databases that missed updates, resolve any conflicts, and go on your merry way, all without impacting users.
Instead, Google is asking their users to correct their database for them. My jaw dropped. From the post:
In general, Google seems to have what borders on disdain for modern databases. They seem to see rigid file system and database consistency as the paranoid ramblings of the anal retentive, preferring the very loose consistency guarantees offered by their core systems such as Google File System and BigTable.
I think consistency does matter. Databases shouldn't lose data. Problems with the database shouldn't be seen by users. Databases should do what they're supposed to do, store data and give it back again when you want it.
For now, the problem is in Blogger. No financial transactions are involved. But what happens when Google expands their payment system (GBuy) or moves into e-commerce? If Google doesn't start caring about data consistency, this problem will bite them again and again.
See also my previous post, "Lowered uptime expectations?"
Apparently, some of the boxes in their database cluster had the right data and some did not. This is a relatively common problem in a database cluster. It can happen if replication fails because the network gets partitioned or whatever.
The normal solution is to backfill the databases that missed updates, resolve any conflicts, and go on your merry way, all without impacting users.
Instead, Google is asking their users to correct their database for them. My jaw dropped. From the post:
I'm very sorry to say that, if your blog was on this database, posts and template changes made in the last 18 hours or so were not saved. They may appear on your blog now, but will disappear if you republish. If you made a post between Friday afternoon and now, we suggest that you look at your list of posts ("Posting" tab, "Edit posts" sub-tab) and compare it with what is published on your blog. If posts are missing, copy them from your blog pages before you republish.Whaaaa? Google, king of the cluster, you want your users to cut-and-paste entries from their blog to fix your problems with database consistency?
In general, Google seems to have what borders on disdain for modern databases. They seem to see rigid file system and database consistency as the paranoid ramblings of the anal retentive, preferring the very loose consistency guarantees offered by their core systems such as Google File System and BigTable.
I think consistency does matter. Databases shouldn't lose data. Problems with the database shouldn't be seen by users. Databases should do what they're supposed to do, store data and give it back again when you want it.
For now, the problem is in Blogger. No financial transactions are involved. But what happens when Google expands their payment system (GBuy) or moves into e-commerce? If Google doesn't start caring about data consistency, this problem will bite them again and again.
See also my previous post, "Lowered uptime expectations?"
Friday, February 17, 2006
Google and The Happy Searcher
I recently happened across a paper by four Googlers called "The Happy Searcher: Challenges in Web Information Retrieval" (PDF).
It includes discussions of some of the challenges in web, image, audio, and video search, dealing with spammers, evaluating relevance rank quality, and search, reputation, and recommendations in Google Groups.
It's a great read and an easy, light read. Even if you don't normally read technical papers, this one is definitely worth a peek.
Here are some selected extended excerpts:
It includes discussions of some of the challenges in web, image, audio, and video search, dealing with spammers, evaluating relevance rank quality, and search, reputation, and recommendations in Google Groups.
It's a great read and an easy, light read. Even if you don't normally read technical papers, this one is definitely worth a peek.
Here are some selected extended excerpts:
One particularly intriguing problem in web IR arises from the attempt by some commercial interests to unduly heighten the ranking of their web pages by engaging in various forms of spamming .... Classification schemes must work in an adversarial context as spammers will continually seek ways of thwarting automatic filters. Adversarial classification is an area in which precious little work has been done.If you like this paper, there's another older paper that might also be of interest, "Web Information Retrieval - an Algorithmic Perspective" by Monika Henzinger from Google. It is a nice, short, light overview of the components of a web search engine and some of the challenges of web search.
Proper evaluation of [relevance rank] improvements is a non-trivial task ... Recent efforts in this area have examined interleaving the results of two different ranking schemes and using statistical tests based on the results users clicked on to determine which ranking scheme is "better". There has also been work along the lines of using decision theoretic analysis (i.e., maximizing users' utility when searching, considering the relevance of the results found as well as the time taken to find those results) as a means for determining the "goodness" of a ranking scheme.
At the article or posting level, one can similarly rank not just by content relevance, but also take into account aspects of articles that not normally associated with web pages, such as temporal information (when a posting was made), thread information, the author of the article, whether the article quotes another post, whether the proportion of quoted content is much more than the proportion of original content, etc .... Furthermore, one can also attempt to compute the inherent quality or credibility level of an author independent of the query.
Content Similarity Assessment ... [attempts] to find images (audio tracks) that are similar to the query items. For example, the user may provide an image (audio snippet) of what the types of results that they are interested in finding, and based on low-level similarity measures, such as (spatial) color histograms, audio frequency histograms, etc, similar objects are returned. Systems such as these have often been used to find images of sunsets, blue skies, etc. and have also been applied to the task of finding similar music genres.
The Google spelling corrector takes a Machine Learning approach that leverages an enormous volume of text to build a very fine grained probabilistic context sensitive model for spelling correction ... By employing a context sensitive model, the system will correct the text "Mehran Salhami" to "Mehran Sahami" even though "Salami" is a common English word and is the same edit distance from "Salhami" as "Sahami." Such fine grained context sensitivity can only be achieved through analyzing very large quantities of text.
Web information retrieval presents a wonderfully rich and varied set of problems where AI techniques can make critical advances ... We hope to stimulate still more research in this area that will make use of the vast amount of information on the web.
Wednesday, February 15, 2006
Office Live is not Office Live?
Richard MacManus reports that Microsoft launched Office Live today.
The first thing that struck me about Office Live is the comment from Jupiter Research analyst Joe Wilcox: "Office Live absolutely is not a hosted version of Microsoft Office."
Sure enough, looking at the About Office Live page reveals that Office Live is little more than some tools for creating a hosted website for a business, on first glance quite similar to Yahoo Small Business and other existing hosting products.
Why is it called Office Live then? Taking a strong brand name, Microsoft Office, and adding a modifier to it, the word Live, would lead most people to conclude that Office Live should be some nifty "Live" version of MS Office. It is not.
This is more branding foolishness from Microsoft. In their attempt to hype Windows Live, they have tossed the label on everything. In the process, they are diluting the brand of MS Office and causing customer confusion.
See also my previous post, "Is it Live or MSN?"
Update: Robert Scoble is confused too: "Office Live isn’t what you think it is. It wasn’t what I thought it was ... Damn the marketers who are extending the Office brand."
Update: Don't miss this great post by David Hunter, "Microsoft relaunches bCentral, calls it Office Live".
Update: Matthew Ingram asks, "Why cheapen a potentially hot brand idea like Office Live by pasting it on something that looks like a bag of warmed-over, also-ran features?"
The first thing that struck me about Office Live is the comment from Jupiter Research analyst Joe Wilcox: "Office Live absolutely is not a hosted version of Microsoft Office."
Sure enough, looking at the About Office Live page reveals that Office Live is little more than some tools for creating a hosted website for a business, on first glance quite similar to Yahoo Small Business and other existing hosting products.
Why is it called Office Live then? Taking a strong brand name, Microsoft Office, and adding a modifier to it, the word Live, would lead most people to conclude that Office Live should be some nifty "Live" version of MS Office. It is not.
This is more branding foolishness from Microsoft. In their attempt to hype Windows Live, they have tossed the label on everything. In the process, they are diluting the brand of MS Office and causing customer confusion.
See also my previous post, "Is it Live or MSN?"
Update: Robert Scoble is confused too: "Office Live isn’t what you think it is. It wasn’t what I thought it was ... Damn the marketers who are extending the Office brand."
Update: Don't miss this great post by David Hunter, "Microsoft relaunches bCentral, calls it Office Live".
Update: Matthew Ingram asks, "Why cheapen a potentially hot brand idea like Office Live by pasting it on something that looks like a bag of warmed-over, also-ran features?"
Tuesday, February 14, 2006
iTunes music recommendations
Alyce Lomax at the Motley Fool notes that Apple's iTunes Music Store is now making music recommendations based on past purchases.
When I fired up iTunes and took a look, my recommendations seemed reasonable. In my case, the recommendations overemphasized some older purchases that no longer reflect my tastes in music, but that's probably a minor issue that they'll iron out soon enough.
I'm pleased that Apple is doing this. More than once, I had enough trouble finding things to buy on iTunes that I've hopped over to Amazon to look at their similarities and recommendations, then returned to iTunes to buy. Of course, most people wouldn't bother doing that.
It's about time Apple added more personalization and discovery features to the iTunes Music Store. Let's hope there's more to come.
[Found on Findory]
Update: The WSJ reports that Amazon.com soon will offer iTunes-like digital music downloads. Maybe I won't have to wait for Apple. [via TechDirt]
Update: Kate Moser at the CSM just wrote an interesting article on music recommendations. It includes a prediction that "taste-sharing applications" will drive 25% of online music sales by 2010.
The iTunes Music Store has a beta feature called Just For You, which suggested albums I might be interested in based on past iTunes purchases.Alyce goes on to say the feature is comparable to Amazon.com's recommendations and that "accurate recommendations add much more than convenience to Internet-based shopping."
Was it accurate? You bet. I bought a few albums, EPs, and singles.
When I fired up iTunes and took a look, my recommendations seemed reasonable. In my case, the recommendations overemphasized some older purchases that no longer reflect my tastes in music, but that's probably a minor issue that they'll iron out soon enough.
I'm pleased that Apple is doing this. More than once, I had enough trouble finding things to buy on iTunes that I've hopped over to Amazon to look at their similarities and recommendations, then returned to iTunes to buy. Of course, most people wouldn't bother doing that.
It's about time Apple added more personalization and discovery features to the iTunes Music Store. Let's hope there's more to come.
[Found on Findory]
Update: The WSJ reports that Amazon.com soon will offer iTunes-like digital music downloads. Maybe I won't have to wait for Apple. [via TechDirt]
Update: Kate Moser at the CSM just wrote an interesting article on music recommendations. It includes a prediction that "taste-sharing applications" will drive 25% of online music sales by 2010.
Monday, February 13, 2006
Oracle to buy Sleepycat?
According to BusinessWeek, Oracle may be close to buying Sleepycat Software, maker of the excellent BerkeleyDB database.
That would be sad. I'd hate to see one of my favorite little companies disappear into the distended folds of the Oracle behemoth.
[via Matt Marshall]
Update: Sigh. The rumor is true. Oracle bought Sleepycat Software.
That would be sad. I'd hate to see one of my favorite little companies disappear into the distended folds of the Oracle behemoth.
[via Matt Marshall]
Update: Sigh. The rumor is true. Oracle bought Sleepycat Software.
Friday, February 10, 2006
Ratings in Google Groups
Reto Meier noticed that Google now is allowing users to rate posts in Google Groups.
At this point, it isn't clear for what purpose the ratings will be used. In the long-term, I could imagine that they might allow them to filter out spam and crap, filter in the best posts (like Slashdot), or even recommend articles based on what you tend to like.
If you're interested in this kind of thing, there is some fun previous work here. Don't miss the old U of Minnesota paper "GroupLens: Applying Collaborative Filtering to Usenet News". And I really enjoyed the paper "Slash(dot) and Burn" by Cliff Lampe and Paul Resnick.
When looking at this, I also noticed that the Google Groups home page now makes group recommendations, a list of "Suggested groups". I'm not sure what information Google uses for this, but I was amused and a little taken aback to see that my top recommendation was alt.support.cancer.prostate. Does Google know something I don't know?
[Found via Philipp Lenssen]
At this point, it isn't clear for what purpose the ratings will be used. In the long-term, I could imagine that they might allow them to filter out spam and crap, filter in the best posts (like Slashdot), or even recommend articles based on what you tend to like.
If you're interested in this kind of thing, there is some fun previous work here. Don't miss the old U of Minnesota paper "GroupLens: Applying Collaborative Filtering to Usenet News". And I really enjoyed the paper "Slash(dot) and Burn" by Cliff Lampe and Paul Resnick.
When looking at this, I also noticed that the Google Groups home page now makes group recommendations, a list of "Suggested groups". I'm not sure what information Google uses for this, but I was amused and a little taken aback to see that my top recommendation was alt.support.cancer.prostate. Does Google know something I don't know?
[Found via Philipp Lenssen]
Thursday, February 09, 2006
Web 2.0 and garbage in, garbage out
In a review of Zillow, Nicholas Carr makes some great points about services that republish bad data:
To take an example from Findory, people might think that crawling RSS feeds is easy, but I've been amazed by the crap that people throw into their feeds. I see entire HTML web pages thrown in to the description section. I see Javascript. I see all kinds of turds. The data is dirty, dirty. Most of the code in Findory's crawl is devoted to cleaning the data.
Dirty data is useless to users. If you're not going to put the effort in to make sure your data is good, people aren't going to put the effort in to use your website.
Entrepreneurs are launching all sorts of sites and services that are built on data that they're siphoning out of third-party sites and databases. Sometimes, the secondhand data is good; sometimes, it's not ... Unfortunately, to the user, the inaccuracies are invisible.It's not enough to mashup some data streams, remix it, throw some pretty AJAX on top, and spew it out on the web. For a product to be useful, the data has to be clean, correct, and reliable.
There's one line of thinking about the unreliability of web information that says, essentially, "get used to it." People are just going to have to become more sophisticated consumers of information.
That's nice in theory, but it doesn't wash in the real world. It's like selling wormy apples and telling customers that they're just going to have to become more sophisticated eaters of apples. Fruit buyers don't like worms, and information seekers don't like bad facts.
To take an example from Findory, people might think that crawling RSS feeds is easy, but I've been amazed by the crap that people throw into their feeds. I see entire HTML web pages thrown in to the description section. I see Javascript. I see all kinds of turds. The data is dirty, dirty. Most of the code in Findory's crawl is devoted to cleaning the data.
Dirty data is useless to users. If you're not going to put the effort in to make sure your data is good, people aren't going to put the effort in to use your website.
Motivating switching from Google
Mike at TechDirt posts some good thoughts on Yahoo, MSN, and Amazon paying users or thinking about paying users to switch from Google:
But there are other ways. In Microsoft's case, being the default search engine in the default browser in the default operating system means that, to use Google, people already need to make the effort to switch to Google. The trick might be just not to suck so bad that people make that switch when they get a new computer. Just being good enough could work here.
I think Yahoo has it harder, but, like Microsoft, Yahoo does have a group of people who use Yahoo every day for services like Yahoo Mail, Yahoo Finance, Yahoo News, and My Yahoo. These people are already on Yahoo, but are leaving to do their searches. Why? If you're on Yahoo, shouldn't it easier to use Yahoo Search than go to Google? Right now, it doesn't seem easier to me, but it should be.
With Amazon, Ask Jeeves, and the other smaller players, life is more difficult. These folks are unlikely to generate a better relevance rank that Google -- not unless Google trips and stops innovating like AltaVista did -- so a frontal assault is not a good option. Given that, if they look anything like Google, there's no reason to use them. They to be different. I think they should go after something -- question answering, personalization, social search, verticals -- that makes them obviously and clearly different than Google. It's the only way to get noticed.
But, bribing people to use their search engine? C'mon, that's just lazy, short-term thinking. Bribes are no way to inspire real loyalty. They would be much better off spending those dollars on making their product worth using.
[Found on Findory]
With all three of these companies talking about bribery as a method to steal users away from Google, it becomes increasingly clear just how "sticky" ... Google really has become.As Mike points out, one way to get people to switch is to be obviously better. That's what Google did to Altavista to steal the crown.
Getting people to switch is not easy. It can't just be about catching up, or being marginally better -- but about being so overwhelmingly better that people can't afford not to switch.
But there are other ways. In Microsoft's case, being the default search engine in the default browser in the default operating system means that, to use Google, people already need to make the effort to switch to Google. The trick might be just not to suck so bad that people make that switch when they get a new computer. Just being good enough could work here.
I think Yahoo has it harder, but, like Microsoft, Yahoo does have a group of people who use Yahoo every day for services like Yahoo Mail, Yahoo Finance, Yahoo News, and My Yahoo. These people are already on Yahoo, but are leaving to do their searches. Why? If you're on Yahoo, shouldn't it easier to use Yahoo Search than go to Google? Right now, it doesn't seem easier to me, but it should be.
With Amazon, Ask Jeeves, and the other smaller players, life is more difficult. These folks are unlikely to generate a better relevance rank that Google -- not unless Google trips and stops innovating like AltaVista did -- so a frontal assault is not a good option. Given that, if they look anything like Google, there's no reason to use them. They to be different. I think they should go after something -- question answering, personalization, social search, verticals -- that makes them obviously and clearly different than Google. It's the only way to get noticed.
But, bribing people to use their search engine? C'mon, that's just lazy, short-term thinking. Bribes are no way to inspire real loyalty. They would be much better off spending those dollars on making their product worth using.
[Found on Findory]
CEO and CTO both left A9
A9 CTO Ruben Ortega, a friend and colleague from Amazon, also recently left the company.
When I asked him about his and Udi's departure, Ruben sent me this statement:
When I asked him about his and Udi's departure, Ruben sent me this statement:
I was CTO from inception to September 2005.After John Battelle broke the story, A9 CEO Udi Manber's departure to Google is being widely reported in the news media.
I returned to Seattle and Amazon.com because after 20 months of commuting weekly between Seattle and Palo Alto felt I was missing too much of my children's daily lives.
It was a great job and I highly recommend being a CTO of a small growing company if you get a chance :)
I was quite surprised by the news of Udi leaving. I knew we were doing executive level recruiting, but I was under the impression we were looking for my replacement, not Udi's.
Tuesday, February 07, 2006
Don Dodge interviews Findory
Don Dodge posted an interview with me about Findory.
Don was Director of Engineering at AltaVista. He is now Director of Business Development in Microsoft's Emerging Business team.
Don's weblog is a good read. Don't miss this old post where he talks about AltaVista and Google.
Don was Director of Engineering at AltaVista. He is now Director of Business Development in Microsoft's Emerging Business team.
Don's weblog is a good read. Don't miss this old post where he talks about AltaVista and Google.
Udi Manber leaves A9
John Battelle reports that A9 CEO Udi Manber is leaving the company to join... yes, wait for it... Google.
A9 is an Amazon-owned company offering web search. When A9 launched, I was expecting them to go after personalized search, but they instead pursued a form of distributed metasearch, combining search results from many search engines on to one page.
A9 does have some clever and unusual features, including A9 Maps, which can show you pictures of the storefronts on each side of a street for some locations and some cities.
Amazon offers discounts at Amazon.com for using A9, but, despite giving away free money, traffic growth on A9.com has been weak since launch.
Update: Fourteen months later, Udi Manber is now a VP at Google "responsible for core search" while A9 suffers an 80%+ drop in traffic.
A9 is an Amazon-owned company offering web search. When A9 launched, I was expecting them to go after personalized search, but they instead pursued a form of distributed metasearch, combining search results from many search engines on to one page.
A9 does have some clever and unusual features, including A9 Maps, which can show you pictures of the storefronts on each side of a street for some locations and some cities.
Amazon offers discounts at Amazon.com for using A9, but, despite giving away free money, traffic growth on A9.com has been weak since launch.
Update: Fourteen months later, Udi Manber is now a VP at Google "responsible for core search" while A9 suffers an 80%+ drop in traffic.
Is it Live or MSN?
Richard MacManus touches on some of the brand confusion over whether Microsoft web properties should carry the MSN brand or the new Windows Live brand:
With Microsoft slapping the Live label on everything and its mother and promoting the Windows Live brand as the future of Microsoft's web effort, I'm not sure what happens to the existing MSN properties and well-established MSN brand.
Will MSN Search become Windows Live Search? Will MSN.com redirect to Live.com? If not, will Microsoft try to maintain two brands, Windows Live and MSN? Where is the dividing line? What is the difference? Will users understand that difference?
Back in December 2005, I rashly predicted that "Microsoft will abandon Windows Live." After a bit of a ruckus about that, I elaborated by saying that there is "too much confusion between live.com and msn.com" and that "the MSN brand is too valuable to be diluted with an expensive effort to build up a new Windows Live brand."
Perhaps I am overestimating the value of the MSN brand. Perhaps, at the end of the day, it will be Windows Live that is left standing.
Either way, there can be only one. Few outside of the digerati know about Windows Live right now but, when Microsoft tries to promote this to the mainstream, the brand confusion is going to be severe. Something will have to be done.
Update: Richard MacManus posts a nice followup article, "Microsoft's brand confusion - MSN or Live?"
Update: Another followup from Richard MacManus, "Microsoft admits brand confusion between MSN and Live".
Microsoft intends to re-brand MSN as 'MSN Media Network'.I think there is quite a bit of brand confusion here.
The name 'MSN Media Network' seems a little odd to me. Microsoft has re-branded a lot of other things with the "Live" banner, so I'd imagine something like 'Live Media Network' (LMN) would make more sense.
Perhaps I'm underestimating the value of the MSN brand.
With Microsoft slapping the Live label on everything and its mother and promoting the Windows Live brand as the future of Microsoft's web effort, I'm not sure what happens to the existing MSN properties and well-established MSN brand.
Will MSN Search become Windows Live Search? Will MSN.com redirect to Live.com? If not, will Microsoft try to maintain two brands, Windows Live and MSN? Where is the dividing line? What is the difference? Will users understand that difference?
Back in December 2005, I rashly predicted that "Microsoft will abandon Windows Live." After a bit of a ruckus about that, I elaborated by saying that there is "too much confusion between live.com and msn.com" and that "the MSN brand is too valuable to be diluted with an expensive effort to build up a new Windows Live brand."
Perhaps I am overestimating the value of the MSN brand. Perhaps, at the end of the day, it will be Windows Live that is left standing.
Either way, there can be only one. Few outside of the digerati know about Windows Live right now but, when Microsoft tries to promote this to the mainstream, the brand confusion is going to be severe. Something will have to be done.
Update: Richard MacManus posts a nice followup article, "Microsoft's brand confusion - MSN or Live?"
Update: Another followup from Richard MacManus, "Microsoft admits brand confusion between MSN and Live".
Monday, February 06, 2006
GBuy vs. PayPal
Mylene Mangalindan at the Wall Street Journal reports that Google is working on a payment system called GBuy that will allow customers to buy from merchants through Google:
This seems to be part of an ongoing trend where Google gets further and further into eBay's business. A year ago, Bambi Francisco at CBS Marketwatch reported:
Now, with GBuy, Google may be able to cleanly handle the entire transaction, no fuss, no effort.
[via Inside Google]
Update: A month later, Barry Schwartz is part of a closed beta test of Google Payments and says, "The best way to describe Google Payments is calling it a PayPal alternative."
For the last nine months, Google has recruited online retailers to test GBuy, according to one person briefed on the service. GBuy will feature an icon posted alongside the paid-search ads of merchants, which Google hopes will tempt consumers to click on the ads, says this person. GBuy will also let consumers store their credit-card information on Google.There was a rumor seven months ago that Google was developing a payment system called Google Wallet. Since then, Google Wallet appears to have been used for payment internally by Google for Google Video and other products. Now, it appears that Google Wallet may be about to drive a new external fixed price marketplace, GBuy.
This seems to be part of an ongoing trend where Google gets further and further into eBay's business. A year ago, Bambi Francisco at CBS Marketwatch reported:
Will search advertising ultimately become a better mousetrap for sellers? Or to what extent will search advertising take away the potential dollars that were once expected to flow onto eBay's marketplace?Already, many small merchants sell directly using Google AdWords to direct traffic to their site. But, closing the transaction has required setting up a website and payment system, and customers often don't like entering their credit card on small merchant websites.
As one eBay seller said to me via e-mail: "Most sellers are like me and didn't really think there was life outside of eBay, but we are getting educated fast."
Another seller said he's decided to put up his own Web site for $19.95 per month and buy keywords on Google and Yahoo's Overture.
"Products on Google are not under as much pressure as eBay so you can typically get 5 to 10 percent more for your products on Google."
Now, with GBuy, Google may be able to cleanly handle the entire transaction, no fuss, no effort.
[via Inside Google]
Update: A month later, Barry Schwartz is part of a closed beta test of Google Payments and says, "The best way to describe Google Payments is calling it a PayPal alternative."
Sunday, February 05, 2006
Amazon version of AdSense?
Chris Beasley claims Amazon.com soon will test their own advertising network to compete with Google's AdSense.
See also my earlier post, "Kill Google, Vol. 2", where I talk about "going after Google's lifeblood, advertising."
See also my earlier post, "Google wants to change advertising", where I talk about a future of "personalized advertising" that is "helpful and relevant." Amazon.com, with their expertise in personalization, is well positioned here.
[Found on Findory]
See also my earlier post, "Kill Google, Vol. 2", where I talk about "going after Google's lifeblood, advertising."
See also my earlier post, "Google wants to change advertising", where I talk about a future of "personalized advertising" that is "helpful and relevant." Amazon.com, with their expertise in personalization, is well positioned here.
[Found on Findory]
Saturday, February 04, 2006
Lowered uptime expectations?
Recently, a lot of popular websites seem to be having long planned and unplanned downtimes.
Google's Blogger has been taking many planned and unplanned outages over the last few weeks. Yahoo My Web just announced they're going down "for a few hours." Salesforce.com just had multiple outages, including one that lasted almost a day. Gap decided to close Gap.com, OldNavy.com, and BananaRepublic.com for over two weeks. Bloglines took an outage instead of trying to switch to a new data center without any interruption of service. Technorati did something similar a year ago, but took a longer weekend outage for the move. And there are many, many other examples.
Back in the late 1990s at Amazon, I remember we used to think of any downtime as unacceptable. Code was always written to do a smooth migration, old and new boxes interoperating to keep operations seamless. If downtime was taken, it was very early in the morning and short, minimizing the impact to a spattering of insomniacs and international users.
Lately, this view seems downright quaint. Sites are taken down casually for long periods of time. The Gap.com example seems particularly egregious.
Perhaps this is part of a general decline in quality of service. When you outsource customer service to someone who cares even less than you do, when you treat customers with neglect that borders on hostility, perhaps taking downtime is just part of the package.
It is true that customers seem to have grown to accept these outages as the norm. But maybe we should demand more from web companies.
See also Om Malik's post, "The Web 2.0 hit by outages".
See also Nicholas Carr's post, "Salesforce.com's hiccups".
See also my earlier post, "The folly of ignoring scaling".
Google's Blogger has been taking many planned and unplanned outages over the last few weeks. Yahoo My Web just announced they're going down "for a few hours." Salesforce.com just had multiple outages, including one that lasted almost a day. Gap decided to close Gap.com, OldNavy.com, and BananaRepublic.com for over two weeks. Bloglines took an outage instead of trying to switch to a new data center without any interruption of service. Technorati did something similar a year ago, but took a longer weekend outage for the move. And there are many, many other examples.
Back in the late 1990s at Amazon, I remember we used to think of any downtime as unacceptable. Code was always written to do a smooth migration, old and new boxes interoperating to keep operations seamless. If downtime was taken, it was very early in the morning and short, minimizing the impact to a spattering of insomniacs and international users.
Lately, this view seems downright quaint. Sites are taken down casually for long periods of time. The Gap.com example seems particularly egregious.
Perhaps this is part of a general decline in quality of service. When you outsource customer service to someone who cares even less than you do, when you treat customers with neglect that borders on hostility, perhaps taking downtime is just part of the package.
It is true that customers seem to have grown to accept these outages as the norm. But maybe we should demand more from web companies.
See also Om Malik's post, "The Web 2.0 hit by outages".
See also Nicholas Carr's post, "Salesforce.com's hiccups".
See also my earlier post, "The folly of ignoring scaling".
Friday, February 03, 2006
Early Amazon: Pagers, pagers
I loved a lot of things about Amazon, but one thing I absolutely hated was my pager.
After the website split, I somehow became mistaken for an expert on the website. In that lofty role, I had to carry a pager.
Any time the website or tools related to the website had a problem, that pager yelled. Like a crying baby, the cause was not always immediately obvious, sometimes requiring hours of investigation. The screams of the Amazon website interrupted sleep, life, and happiness.
It felt like slow torture.
I began to hate my pager. I dreamed of hurling it into the ocean, smashing it under heavy objects, burying it deep never to be found. It became personal. This annoying, buzzing thing seemed determined to steal my remaining fragments of sanity.
I must not have been the only one to feel this way. A good friend of mine accidentally dropped his pager into Lake Washington. It sleeps with the fishes now. His delight at that thought makes me question whether it was no accident.
That pager, that irritating little demon, remained strapped to my waist for years, but I finally shed it in 2001. Ah, pagerless. Life was good.
With Findory, the pager had to return. But, I have finally learned the lesson. If the website never has a problem, the pager never goes off. And that's the way I like it.
After the website split, I somehow became mistaken for an expert on the website. In that lofty role, I had to carry a pager.
Any time the website or tools related to the website had a problem, that pager yelled. Like a crying baby, the cause was not always immediately obvious, sometimes requiring hours of investigation. The screams of the Amazon website interrupted sleep, life, and happiness.
It felt like slow torture.
I began to hate my pager. I dreamed of hurling it into the ocean, smashing it under heavy objects, burying it deep never to be found. It became personal. This annoying, buzzing thing seemed determined to steal my remaining fragments of sanity.
I must not have been the only one to feel this way. A good friend of mine accidentally dropped his pager into Lake Washington. It sleeps with the fishes now. His delight at that thought makes me question whether it was no accident.
That pager, that irritating little demon, remained strapped to my waist for years, but I finally shed it in 2001. Ah, pagerless. Life was good.
With Findory, the pager had to return. But, I have finally learned the lesson. If the website never has a problem, the pager never goes off. And that's the way I like it.
Thursday, February 02, 2006
Early Amazon: Splitting the website
Now that commodity Linux servers are so commonplace, it is easy to forget that, during the mid and late 1990's, there was a lively debate between whether sites should scale up using massive, mainframe-like servers (Sun's Star Fire being a good example) or scale out using cheap hardware (e.g. Intel desktop PCs running Linux).
When I joined Amazon in early 1997, there was one massive database and one massive webserver. Big iron.
As Amazon grew, this situation became less and less desirable. Not only was it costly to scale up big iron, but we didn't like having a beefy box as a single point of failure. Amazon needed to move to a web cluster.
Just a few months after I started at Amazon, I was part of a team of two responsible for the website split. I was working with an extremely talented developer named Bob.
Bob was a madman. There was nothing that seemed to stop him. Problem on the website no one else could debug? He'd attach to a live obidos process and figure it out in seconds. Database inexplicably hanging? He'd start running process traces on the live Oracle database and uncover bugs in Oracle's code. Crazy. He seemed to know no limits, nothing that he wouldn't attack and figure out. Working with him was inspirational.
Bob and I were going to take Amazon from one webserver to multiple webservers. This was more difficult than it sounds. There were dependencies in the tools and in obidos that assumed one webserver. Some systems even directly accessed the data stored on the webserver. In fact, there were so many dependencies in the code base that, just to get this done in any reasonable amount of time, it was necessary to maintain backward compatibility as much as possible.
We designed a rough architecture for the system. There would be two staging servers, development and master, and then a fleet of online webservers. The staging servers were largely designed for backward compatibility. Developers would share data with development when creating new website features. Customer service, QA, and tools would share data with master. This had the added advantage of making master a last wall of defense where new code and data would be tested before it hit online.
Read-only data would be pushed out through this pipeline. Logs would be pulled off the online servers. For backward compatibility with log processing tools, logs would be merged so they looked like they came from one webserver and then put on a fileserver.
Stepping out for a second, this is a point where we really would have liked to have a robust, clustered, replicated, distributed file system. That would have been perfect for read-only data used by the webservers.
NFS isn't even close to this. It isn't clustered or replicated. It freezes all clients when the NFS server goes down. Ugly. Options that are closer to what we wanted, like CODA, were (and still are) in the research stage.
Without a reliable distributed file system, we were down to manually giving each webserver a local copy of the read-only data. Again, existing tools failed us. We wanted a system that was extremely fast and would do versioning and rollback. Tools like rdist were not sufficient.
So we wrote it ourselves. Under enormous time pressure. The current big iron was melting under "get big fast" load, a situation that was about to get much worse as Christmas approached. We needed this done and done yesterday.
We got it done in time. We got Amazon running on a fleet of four webservers. It all worked. Tools continued to function. Developers even got personal websites, courtesy of Bob, that made testing and debugging much easier than before.
Well, it sort of just worked. I am embarrassed to admit that some parts of the system started to break down more quickly than I had hoped. My poor sysadmin knowledge bit me badly as the push and pull tools I wrote failed in ways a more seasoned geek would have caught. Worse, the system was not nearly robust enough to unexpected outages in the network and machines, partially because we were not able to integrate the system with the load balancer (an early version of Cisco Local Director) that we were using to take machines automatically in and out of service.
But it did work. Amazon never would have been able to handle 1997 holiday traffic without what we did. I am proud to have been a part of it.
Of course, all of this is obsolete. Amazon switched to Linux (as did others), substantially increasing the number of webservers, and eventually Amazon switched to a deep services-based architecture. Needs changed, and the current Amazon web cluster bares little resemblance to what I just described.
But, when building Findory's webserver cluster, I again found myself wanting a reliable, clustered, distributed file system, and again found the options lacking. I again wanted tools for fast replication of files with versioning and rollback, and again found those missing. As I looked to solve those problems, the feeling of deja vu was unshakeable.
When I joined Amazon in early 1997, there was one massive database and one massive webserver. Big iron.
As Amazon grew, this situation became less and less desirable. Not only was it costly to scale up big iron, but we didn't like having a beefy box as a single point of failure. Amazon needed to move to a web cluster.
Just a few months after I started at Amazon, I was part of a team of two responsible for the website split. I was working with an extremely talented developer named Bob.
Bob was a madman. There was nothing that seemed to stop him. Problem on the website no one else could debug? He'd attach to a live obidos process and figure it out in seconds. Database inexplicably hanging? He'd start running process traces on the live Oracle database and uncover bugs in Oracle's code. Crazy. He seemed to know no limits, nothing that he wouldn't attack and figure out. Working with him was inspirational.
Bob and I were going to take Amazon from one webserver to multiple webservers. This was more difficult than it sounds. There were dependencies in the tools and in obidos that assumed one webserver. Some systems even directly accessed the data stored on the webserver. In fact, there were so many dependencies in the code base that, just to get this done in any reasonable amount of time, it was necessary to maintain backward compatibility as much as possible.
We designed a rough architecture for the system. There would be two staging servers, development and master, and then a fleet of online webservers. The staging servers were largely designed for backward compatibility. Developers would share data with development when creating new website features. Customer service, QA, and tools would share data with master. This had the added advantage of making master a last wall of defense where new code and data would be tested before it hit online.
Read-only data would be pushed out through this pipeline. Logs would be pulled off the online servers. For backward compatibility with log processing tools, logs would be merged so they looked like they came from one webserver and then put on a fileserver.
Stepping out for a second, this is a point where we really would have liked to have a robust, clustered, replicated, distributed file system. That would have been perfect for read-only data used by the webservers.
NFS isn't even close to this. It isn't clustered or replicated. It freezes all clients when the NFS server goes down. Ugly. Options that are closer to what we wanted, like CODA, were (and still are) in the research stage.
Without a reliable distributed file system, we were down to manually giving each webserver a local copy of the read-only data. Again, existing tools failed us. We wanted a system that was extremely fast and would do versioning and rollback. Tools like rdist were not sufficient.
So we wrote it ourselves. Under enormous time pressure. The current big iron was melting under "get big fast" load, a situation that was about to get much worse as Christmas approached. We needed this done and done yesterday.
We got it done in time. We got Amazon running on a fleet of four webservers. It all worked. Tools continued to function. Developers even got personal websites, courtesy of Bob, that made testing and debugging much easier than before.
Well, it sort of just worked. I am embarrassed to admit that some parts of the system started to break down more quickly than I had hoped. My poor sysadmin knowledge bit me badly as the push and pull tools I wrote failed in ways a more seasoned geek would have caught. Worse, the system was not nearly robust enough to unexpected outages in the network and machines, partially because we were not able to integrate the system with the load balancer (an early version of Cisco Local Director) that we were using to take machines automatically in and out of service.
But it did work. Amazon never would have been able to handle 1997 holiday traffic without what we did. I am proud to have been a part of it.
Of course, all of this is obsolete. Amazon switched to Linux (as did others), substantially increasing the number of webservers, and eventually Amazon switched to a deep services-based architecture. Needs changed, and the current Amazon web cluster bares little resemblance to what I just described.
But, when building Findory's webserver cluster, I again found myself wanting a reliable, clustered, distributed file system, and again found the options lacking. I again wanted tools for fast replication of files with versioning and rollback, and again found those missing. As I looked to solve those problems, the feeling of deja vu was unshakeable.
The future of Yahoo News
Gavin O'Malley reports on some announcements by Yahoo News GM Neil Budde on the future of Yahoo News.
When discussing several planned upgrades to Yahoo News, Neil said that Yahoo News "will eventually have access to personalized news pages, on which stories are updated based on the viewer's recent viewing habits." It sounds like Yahoo will be doing implicit personalization of news based on reader behavior in addition to the explicit customization that is already part of My Yahoo.
Neil also said "community edited" news would be "the next wave" for Yahoo News.
When discussing several planned upgrades to Yahoo News, Neil said that Yahoo News "will eventually have access to personalized news pages, on which stories are updated based on the viewer's recent viewing habits." It sounds like Yahoo will be doing implicit personalization of news based on reader behavior in addition to the explicit customization that is already part of My Yahoo.
Neil also said "community edited" news would be "the next wave" for Yahoo News.
Wednesday, February 01, 2006
Relevance rank and broad queries
I was playing with a few broad search queries recently and the results surprised me.
I was expecting any query that returns most of the Web (such as [the]) to yield links to the sites with the highest PageRank. Such an indiscriminating query would seem to provide very little basis for doing much else.
So, I would expect Google search results for the most popular English words ([the], [of], [to], [and]) to be the same, as also the search results for [* *] (any two words separated by a space), [the * the] ("the" followed by some words followed by "the"), and [1..1000] (any number between 1 and 1000).
As you can see by clicking on those links, all of those queries return most of the Web, between 5B pages (for [the * the]) and 20B pages (for [and] and [to]). But they return very different results.
The top result for [the] is "The Onion". The top result for [of] is "The Library of Congress". For [to], "Welcome to the White House". For [and], NASA.
The results for [1..1000] seem to be the closest to what I was expecting. It shows links to Netscape, Mozilla, Microsoft, IE, Macromedia Flash, Apple Quicktime, perhaps the most linked to sites on the web?
But, no, even those do not appear to be in PageRank order. For example, the first search result, the Netscape site, only has a PageRank of 8/10 and the Macromedia Flash site a PageRank of 5/10. Huh, again, not quite what I was expecting.
Curious. Why would these results differ so wildly? All of these pages contain these words. Why the difference in what makes it to the top?
Perhaps we should look at what other sites do as well. What do the results look like for [the], [of], [to], and [and] on Yahoo Search? They also differ from each other and from Google's. In fact, these results seem even more strange, with Crate and Barrel making it to the top on [and] and some site called To-Done being the top for [to]. Hmm...
Only MSN Search behaves even remotely close to what I was expecting. Results for [of], [to], [the], and [and] are fairly similar.
Looking at the Google results again, it may be the case that page title and text in links counts for a fair amount. "The Onion" may get its prime spot on a search for [the] because a lot of people link to it as The Onion. But, does that explain why the White House gets top billing for [to] when only the page title ("Welcome to the White House") has that word?
Perhaps this is just a spot where small weightings deep in the Google guts make a nonsensical difference. When there's so little information about searcher intent -- a search for [the] or [and] -- it matters little what you show. Little tinkers here and there that might make a difference when intentions are clearer are probably just revealing themselves in odd ways for these broad queries.
Nevertheless, I thought it was curious. Not what I expected to see.
I was expecting any query that returns most of the Web (such as [the]) to yield links to the sites with the highest PageRank. Such an indiscriminating query would seem to provide very little basis for doing much else.
So, I would expect Google search results for the most popular English words ([the], [of], [to], [and]) to be the same, as also the search results for [* *] (any two words separated by a space), [the * the] ("the" followed by some words followed by "the"), and [1..1000] (any number between 1 and 1000).
As you can see by clicking on those links, all of those queries return most of the Web, between 5B pages (for [the * the]) and 20B pages (for [and] and [to]). But they return very different results.
The top result for [the] is "The Onion". The top result for [of] is "The Library of Congress". For [to], "Welcome to the White House". For [and], NASA.
The results for [1..1000] seem to be the closest to what I was expecting. It shows links to Netscape, Mozilla, Microsoft, IE, Macromedia Flash, Apple Quicktime, perhaps the most linked to sites on the web?
But, no, even those do not appear to be in PageRank order. For example, the first search result, the Netscape site, only has a PageRank of 8/10 and the Macromedia Flash site a PageRank of 5/10. Huh, again, not quite what I was expecting.
Curious. Why would these results differ so wildly? All of these pages contain these words. Why the difference in what makes it to the top?
Perhaps we should look at what other sites do as well. What do the results look like for [the], [of], [to], and [and] on Yahoo Search? They also differ from each other and from Google's. In fact, these results seem even more strange, with Crate and Barrel making it to the top on [and] and some site called To-Done being the top for [to]. Hmm...
Only MSN Search behaves even remotely close to what I was expecting. Results for [of], [to], [the], and [and] are fairly similar.
Looking at the Google results again, it may be the case that page title and text in links counts for a fair amount. "The Onion" may get its prime spot on a search for [the] because a lot of people link to it as The Onion. But, does that explain why the White House gets top billing for [to] when only the page title ("Welcome to the White House") has that word?
Perhaps this is just a spot where small weightings deep in the Google guts make a nonsensical difference. When there's so little information about searcher intent -- a search for [the] or [and] -- it matters little what you show. Little tinkers here and there that might make a difference when intentions are clearer are probably just revealing themselves in odd ways for these broad queries.
Nevertheless, I thought it was curious. Not what I expected to see.
Marissa Mayer on innovation
Google VP Marissa Mayer has an interesting column in BusinessWeek about creativity and innovation at Google.
Some excerpts on development of the Google Toolbar:
Innovation is exploration of the unknown. No one knows the best path. It is important to try many things, learn what works and what doesn't, quickly reject your failures, and build on your successes. That is the key to innovation.
See also Google CEO Eric Schmidt's thoughts on encouraging innovation in my earlier post, "Making innovation run rampant".
See also Amazon CEO Jeff Bezos' thoughts on encouraging innovation in my earlier post, "Jeff Bezos, the explorer".
Some excerpts on development of the Google Toolbar:
In the case of the Toolbar Beta, several of the key features (custom buttons, shared bookmarks) were prototyped in less than a week.Rapid prototyping. Experiment and learn. Fail quickly and move on.
In fact, during the brainstorming phase, we tried out about five times as many key features -- many of which we discarded after a week of prototyping. Since only 1 in every 5 to 10 ideas work out, the strategy of constraining how quickly ideas must be proven allows us try out more ideas faster, increasing our odds of success.
Speed also lets you fail faster ... It's important to discover failure fast and abandon it quickly. A limited investment makes it easier to walk away and move on to something else that has a better chance of success.
Innovation is exploration of the unknown. No one knows the best path. It is important to try many things, learn what works and what doesn't, quickly reject your failures, and build on your successes. That is the key to innovation.
See also Google CEO Eric Schmidt's thoughts on encouraging innovation in my earlier post, "Making innovation run rampant".
See also Amazon CEO Jeff Bezos' thoughts on encouraging innovation in my earlier post, "Jeff Bezos, the explorer".
Amazon Plogs
It appears Amazon.com has launched "Plogs" to the main Amazon home page. What are plogs? From the help page:
On the one hand, Plogs are an interesting attempt to allow readers closer contact with authors and artists. For example, I have read many O'Reilly books, and I read the blog O'Reilly Radar. That might indicate that I have some interest in what the authors and publishers of those books have to say.
On the other hand, the benefit of this isn't clear. If this works, people might come back to Amazon every day, like they come back to weblogs every day, to read the latest. But that has to be balanced with taking up the entire above the fold real estate on the Amazon home page, extremely valuable space, with content that is not directly related to finding and buying products.
In the end, I'm not sure this will work out. My wife's reaction when she saw her plog was, "Eww... I hate weblogs." Despite the delusions of self-importance in most of the blogosphere, I think this is the typical reaction to weblogs. Most people don't have time to read the poorly thought out, poorly written, stream of consciousness drivel that is slapped up on most weblogs every day. So they don't.
If that experience from weblogs transfers to Amazon Plogs, it means that the service will be popular among a smaller audience, fans and book fanatics, but not for the majority of Amazon shoppers who just want to shop and go.
See also my previous post, "The mainstream and saving people time".
Update: As if to drive the point home, the top "Greg's Plog" entry on my Amazon home page today actually said, "You'll find me writing about whatever thoughts are rolling around in my head at this particular moment in time ... without worrying whether or not what I'm writing might potentially aggravate, alienate or just plain confuse people who are simply wandering by." Mmm... crappy content. My favorite.
Update: Several days later, my plog content hasn't changed. Not only does this mean that the content is boring and irrelevant because it is stale, but also it seems to defeat the purpose of this feature, which I would assume would be to encourage me to return to the Amazon.com website every day. Geez, what is Amazon doing?
Your Amazon.com Plog is a personalized web log that appears on your customer home page. Every person's Plog is different (hence the name).Right now, my Amazon home page is completely taken over by "Greg's Plog". There are two articles in my Plog, one from Mark Frauenfelder announcing the Maker Faire (because I bought Make Magazine through Amazon) and one from Alan Schwartz about picking a mail server (because I bought "Practical Unix and Internet Security").
Posts in your Plog come from many sources, including authors of books you have purchased on Amazon.com.
On the one hand, Plogs are an interesting attempt to allow readers closer contact with authors and artists. For example, I have read many O'Reilly books, and I read the blog O'Reilly Radar. That might indicate that I have some interest in what the authors and publishers of those books have to say.
On the other hand, the benefit of this isn't clear. If this works, people might come back to Amazon every day, like they come back to weblogs every day, to read the latest. But that has to be balanced with taking up the entire above the fold real estate on the Amazon home page, extremely valuable space, with content that is not directly related to finding and buying products.
In the end, I'm not sure this will work out. My wife's reaction when she saw her plog was, "Eww... I hate weblogs." Despite the delusions of self-importance in most of the blogosphere, I think this is the typical reaction to weblogs. Most people don't have time to read the poorly thought out, poorly written, stream of consciousness drivel that is slapped up on most weblogs every day. So they don't.
If that experience from weblogs transfers to Amazon Plogs, it means that the service will be popular among a smaller audience, fans and book fanatics, but not for the majority of Amazon shoppers who just want to shop and go.
See also my previous post, "The mainstream and saving people time".
Update: As if to drive the point home, the top "Greg's Plog" entry on my Amazon home page today actually said, "You'll find me writing about whatever thoughts are rolling around in my head at this particular moment in time ... without worrying whether or not what I'm writing might potentially aggravate, alienate or just plain confuse people who are simply wandering by." Mmm... crappy content. My favorite.
Update: Several days later, my plog content hasn't changed. Not only does this mean that the content is boring and irrelevant because it is stale, but also it seems to defeat the purpose of this feature, which I would assume would be to encourage me to return to the Amazon.com website every day. Geez, what is Amazon doing?
Subscribe to:
Posts (Atom)