Thursday, December 28, 2006

Is desktop search over?

In his remarkably detailed review of Windows Vista, Paul Thurrott wrote:
One of the most impressive features in Windows Vista ... is instant search.

Anyone who's struggled with the lousy search functionality in Windows XP or previous Windows versions will be happy to hear that the Vista version is fantastic, delivering near-instantaneous search results while providing the types of advanced features that power users will simply drool over.

Throughout Windows Vista, you will see various search points, all of which are context sensitive.

[For example,] in the right side of the Start Menu ... you can search your ... documents and other data files. As you type a search query in the windows search box, search results begin appearing immediately. The speed at which this happens is pretty impressive ... You can [also] search for applications, ... IE Favorites, email, and other items directly from the Start Menu.
The opportunity for third party desktop search applications like Google Desktop Search only existed because Windows XP desktop search was so pitifully slow.

As I said before, the moment Microsoft corrects this flaw, this opportunity will evaporate, as will the numerous also-ran desktop search apps. It appears Microsoft finally has fixed desktop search in Windows.

Paul's review goes on to say that "Microsoft will work to make instant search more pervasive in [the] future". Integration of search into Windows has long been expected as part of the search war. From a NYT article:
Internet search, according to Microsoft, will increasingly become seamlessly integrated into the Windows desktop operating system, Office productivity software, cellphones powered by Windows, and Xbox video games.

"Search will not be a destination, but it will become a utility" that is more and more "woven into the fabric of all kinds of computing experiences," said Kevin Johnson, co-president of Microsoft's platforms and services division.
And, Bill Gates said something similar over two years ago:
Search is a very pervasive thing. You want to search the Web, you want to search your corporate network, you want to search your local machine, and sometimes you want search to work against multiples of those things.
Now that Microsoft has fixed desktop search, they will integrate search throughout Windows and Windows applications. The easiest and most obvious option for searching will be the search box sitting right in front of you. That box will be powered not by Google or Yahoo, but by Microsoft.

See also my earlier post, "Using the desktop to improve search", where I looked at how several Microsoft Research projects might be used to improve search.

[Paul Thurrott's review found via Joel Spolsky]

Update: Two months later, Mary Jo Foley writes:
One analyst's survey doesn't make a trend. But a Global Equities research analyst said this week that he found "many Vista owners that once used Google's desktop search feature have switched to Microsoft's" desktop search which is built into Windows Vista.

Top Geeking with Greg in 2006

The posts with the most unique views on this weblog in 2006 were:
  1. In a world with infinite storage, bandwidth, and CPU power
    Highlighted notes that Google accidentally left in a PowerPoint presentation. 50k unique views.
  2. A chance to play with big data
    Discussed a release of search log data from AOL Research. 7.5k uniques.
  3. Google's BigTable
    Described a talk by Google on their Bigtable database. 7k uniques. I am a little surprised this is still so popular now that the Bigtable paper has been published.
  4. Lowered uptime expectations?
    Bemoans the unreliability of websites. 5k uniques, mostly because it was featured on Reddit.
  5. Kill Google, Vol. 3
    A piece describing the strategy I would use to attack Google if I were at Microsoft. 5k uniques.
This weblog had 265k unique visits overall in 2006. 83k of those were through Google. The other search engines did not appear in the top five referrers, but Bloglines and Reddit do.

Top search terms that got people to this weblog were "bigtable", "2006 predictions", "geeking with greg", "greg linden", and "aol search data".

Wednesday, December 27, 2006

Google makes 20 cents per search?

Google apparently now makes $0.20 per search from advertising revenue, according to Caris & Co. analyst Tim Boyd as quoted in the BusinessWeek article, "Why Yahoo's Panama Won't Be Enough".
Using data on total search queries, released by comScore, Caris & Co. analyst Tim Boyd estimates that Yahoo made on average between 10 cents and 11 cents per search in 2006, bringing in a total of $1.61 billion for the first nine months of the year.

Google, meanwhile, makes between 19 cents and 21 cents per search. As a result, it made an estimated $4.99 billion during the same period.
Quite an increase over the dime per search of two years ago.

The BusinessWeek article also has some interesting tidbits on Yahoo's Panama, the difficulty of monetizing non-search page views, and the potential of behavioral targeted advertising to improve targeting on non-search page views.

On the topic of early efforts at advertising targeted to past behavior, Barry Schwartz's post, "How Microsoft's Behavioral Targeting Works" at Search Engine Land has a nice excerpt from a recent WSJ article on how Microsoft's adCenter does coarse-grained behavioral targeting.

See also my previous posts, "Microsoft adLab and targeted ads", "Yahoo testing ads targeted to behavior", " AdSense will not do behavioral targeting?", and "Is personalized advertising evil?".

[BW article found via Don Dodge]

Saturday, December 23, 2006

Defining geek

I kind of like this weblog post, "Geek vs. Nerd vs. Dork", and their description of geeks and geeking.

[Found on Valleywag]

Update: See also the Wikipedia entry on "geek".

Five things about me

I have to say, this latest meme strikes me as an unpleasantly narcissistic version of a chain letter. But, I have been "tagged" by three people now -- Mark Fletcher, Rich Skrenta, and Jeremy Zawodny -- so I will play along.

Here are five things that you probably do not know about me:
I was a lifeguard at Rinconada Pool in Palo Alto when I was a teenager. No, not very geeky, but it was a long time ago. I assure you that any coolness I once had is now gone.

I occasionally brew beer, but I am not very good at it. I once tried to brew a batch of Russian Imperial Stout, a very heavy beer, and bottled too early. 44 of 50 bottles exploded with enough force to embed shards of glass in a nearby wall.

As an undergrad, I had an odd double major in Computer Science and Political Science. Geeking out on political economy and game theory is great, but there is precious little overlap between that and computers, so I spent most of college with my nose buried in books.

I used to mess around with artificial life. It was more fun than useful, but I did help develop two simulations that were used in undergrad classrooms, the LEE Project and an iterated prisoner's dilemma simulation (PDF).

When I was a grad student, I got a black lab and named her Pavlova. If you think that is funny, you, like me, probably are a geek.
And now, I am supposed to share the love. How about Andrej Gregov, Steve Yegge, Brian Dennis, Scott Gatz, and John Battelle? You folks want to play?

Thursday, December 21, 2006

Findory in the UK Guardian

Findory was listed as one of the "The new 100 most useful sites" in an article in the UK Guardian today.

More information can be found on the Findory Press page.

Monday, December 18, 2006

Talk on eBay architecture

Randy Shoup and Dan Pritchett gave a talk on scaling eBay, "The eBay Architecture", at SD Forum 2006. The slides are available (PDF).

The parallels with Amazon are remarkable. Like Amazon, eBay started with a two-tiered architecture. Like Amazon, they split the website into a cluster in the late 1990's, followed soon after by partitioning the databases.

Like Amazon, they soon encountered poor performance and difficulty compiling their massive, monolithic binary (150M for eBay, Randy and Dan say). Like Amazon, they started a major rewrite of their monolithic binary around 2001, eventually building a services architecture on top of partitioned databases.

They even built their own search engine because "no off-the-shelf search engine met [their] needs." Amazon did that as well.

It is interesting that their new architecture basically gives up on transactional databases. They say eBay has "absolutely no client side transactions", "no distributed transactions", and "auto-commit for [the] vast majority of DB writes". Instead, they apparently use "careful ordering of DB operations". It sounds like mistakes happen in this system, because they mention running "asynchronous recovery events" and "reconciliation batch" jobs, which, I assume, means asynchronous processes run over the database repairing inconsistencies.

In all, a very interesting talk for anyone who is working or wants to work on big websites and big data. As Tim Bray said, "This ought to be required reading for everyone in this business whose title contains the words 'Web' or 'Architect'."

See also Dan Pritchett's weblog post, "You Scaled Your What?", where he mentions his talk and these slides at the end.

See also some other interesting commentary ([1] [2] [3] [4]) on this talk.

For more on the early work I did scaling Amazon's systems, see my older post, "Early Amazon: Splitting the website". If you liked that, you might also like the rest of my Early Amazon series.

For more on what big companies like eBay, Amazon, and Google need and are not getting from databases, see my previous posts, "C-store and Google BigTable" and "I want a big, virtual database".

[Slides found via Rich Skrenta]

Thursday, December 14, 2006

Property rights on your attention

I enjoyed watching the video of this Google Tech Talk, "An Economic Response To Unsolicited Communication" by Marshall Van Alstyne about e-mail spam.

I liked Marshall's framework for the spam problem. He talked about spam as an externality, like pollution, and proposed a solution based on the Coase theorem that attempts to give people "property rights over their attention."

From his paper:
We propose an "Attention Bond," allowing recipients to define a price that senders must risk to deliver the initial message.

Requiring attention bonds creates an attention market ... to price this scarce resource. In this market, screening mechanisms shift the burden of message classification from recipients to senders, who know message content ... In certain limited cases, this leads to greater welfare than use of even "perfect" filters.
I was mostly interested in the theory discussed in the talk, but Marshall did propose an application for trying to eliminate spam. The basic idea is a whitelist system where senders not on your whitelist have to post a micropayment bond ($.01 - $.05) for you to receive the message. If you determine the message is spam, you seize the bond.

While I enjoyed the talk, the proposed solution has problems. The biggest I see is that there is a quiet assumption that e-mail senders can be identified.

Yes, if you implement a strong identification system over e-mail, you can implement all kinds of promising anti-spam solutions. However, as security guru Bruce Schneier said, "These solutions generally involve re-engineering the Internet, something that is not done lightly."

Marshall addressed other criticisms near the end of his talk, including how the system would deal with honeypots and botnets, but, I think, also may have oversimplified the challenges there.

For example, Marshall claimed that marketers would be careful who they send e-mail to, so someone who sets up a honeybot to seize "attention bonds" would not get much business. But, I suspect enterprising people would not just set up one honeybot, but billions of them, each of which has a forged identity behind it made to look as attractive as possible to marketers. True, we may not have much sympathy for e-mail marketers, but this may threaten to ruin e-mail marketing completely, which would create opposition from the business community to this system.

Slashdot has a post on Marshall Van Alstyne's work, including some snarky comments ([1] [2] [3]) in the discussion.

See also Bruce Schneier's Crypto-Gram post, "The Economics of Spam".

Thomas Claburn at InformationWeek also has an interesting article on Marshall's work with good comments from others in the field.

Wednesday, December 13, 2006

AlwaysOn panel: What is the data telling us?

There are some interesting tidbits on personalization in this July 2006 AlwaysOn panel, "What Is the Data Telling Us?", with Peter Norvig (Google), Jim Lanzone (Ask), Usama Fayyad (Yahoo), and Michael Yavonditte (Quigo).

The panel moderator, Bambi Francisco, focused on privacy issues at the beginning, and the panelists appeared a little reluctant to talk. Usama Fayyad started off early by saying:
Knowing what people do collectively or in segments of special interests gives you a lot of very interesting information and a lot of leverage in terms of product and making things more relevant, including making advertising more relevant, and makes a better service.
A bit generic, but it is a good framing of the problem. We are trying to use aggregate data to make search and advertising more relevant and useful.

Bambi continued poking at the privacy issue, sparking Peter Norvig to say that Google really does not need or want to know everything about you. As Peter explained, building up some uber profile of everything you have ever done is less important than focusing on your recent history:
What's important is not you as an individual, but it's the role you are playing at the moment. When you are looking for one particular piece of information, I don't want to know about you so much as I want to know about all the other people in the same situation and what they did then.

And I'd rather know about what is your history for the last five minutes as you try to solve this problem than know about your history for the last five years.
Exactly right. What matters is your current mission, what you are trying to do right now. We can help by paying attention to what you are doing right now and helping you get it done.

Jim Lanzone chimed in around here, both talking about how users will not do a lot of up-front work in search and expanding on Peter's point about helping people with the problem they are currently trying to solve:
Most users are actually very lazy. While some high end users might use products that require tagging, the vast majority of people won't.

The behavior they will use is to iterate on a search engine. That one white box is just so easy for them to put in whatever is in the top of their head ... then the average searcher will review a result page in 5 seconds or less ... they get clues and then they will iterate their search.

That's why the average search session will have 3 or 4 searches ... That is part of the game for them, is finding a clue, iterating their search, getting more specific, and then finding what they need.

It's not worth their time to sit there and toggle a bunch of things in advance of their query, to then hopefully get a better result. It just saves them time to start going.
At this point, Bambi seemed to shift focus a bit and ask a bunch of questions about personalization and recommendations. Again, Bambi was not getting a lot of answers, but most of the answers she did get were fairly negative toward the idea of personalized search.

For example, Usama said, "You really can't read the searcher's mind," a statement that reminded me of a quote from former A9 CEO Udi Manber: "People will learn to use search better but have to invest the thinking -- we are not in the mind reading business." I was surprised to see Peter echo this point, saying something to the effect that Google would have to be clairvoyant to guess user intent given a search of a couple keywords.

I think both of these statements miss the point of personalized search. The idea is not do to something with nothing. That would be magic, mind reading. No, the idea behind search personalization is to add data about what a searcher has done -- especially what a searcher just did -- to refine the current search.

If the couple keywords in a search are too vague, looking back at a searcher's history may help disambiguate it. If a searcher is iterating and not finding what they want, paying attention to what they just did and did not find can help us narrow down on what they might need.

The entire talk is good fun, worth watching. Usama is focused on Yahoo Answers and social search. Jim talks mostly about search experience and making search easy. Peter adds clarity on a few points and has a few amusing anecdotes. Do not miss Peter's joke around 53:23 in the video about a haiku he found of some searches in the logs, "a story of ... frustration and release", very funny.

Monday, December 11, 2006

35% of sales from recommendations

In an article about startup Aggregate Knowledge, Matt Marshall writes, "Amazon says 35 percent of product sales result from recommendations."

Friday, December 08, 2006

My 2006 predictions: The results

Last year, like many others, I made a bunch of predictions for what would happen in 2006. It is time to look back and see how many I got right.

The press will attack Google, GOOG will drop
False. The press has been more critical of Google, poking at it occasionally on management, privacy issues, and the YouTube deal. But, there has been no major disillusionment or scandal, and the stock price has only gone higher.

Yahoo bets on community, buys more community startups, gets little benefit
True. Yahoo has bet heavily on community and social search (Yahoo Answers, My Web,, but success with these in the mainstream has been mixed. Yahoo acquired and Jumpcut. Yahoo has disappointed investors with their performance, leading to a major reorg recently.

Microsoft launches unsuccessful AdSense competitor
True. Microsoft launched adCenter, which has not yet been successful at threatening Google AdSense.
False. Microsoft launched adCenter, an AdWords competitor, but has not launched an AdSense competitor yet.

MSN Search will increase share
Very false. In fact, MSN Search share dropped substantially. I'll say it again, it really is remarkable how badly MSN Search is doing.

Microsoft will abandon Windows Live
False. What I meant by this prediction is that Microsoft could not maintain both the MSN and the Live brand, so they would choose MSN over an expensive effort to build a new Live brand. But that was wrong too. Microsoft is not abandoning the MSN brand or the Live brand; they are trying to build both brands, creating much confusion.

Mainstream will like tagging images and videos, but not documents
Mostly true. My Web 2.0,, and other apps for tagging documents do not seem to be attracting large audiences. Tagging images on Flickr and videos on YouTube seems reasonably popular, though, even for images and videos, it is not clear that large mainstream audiences widely have embraced the effort required to label things with tags.

Tagging sites will be assaulted by spam
False, at least at the level I was predicting. I thought Technorati,, and Flickr would be flooded with spammers labeling ads and other crap with arbitrary tags, hoping to attract clicks. Technorati and show some spam, but are not "assaulted" by an "influx of crap".

A spam robot will attack Wikipedia
False, but the part of this prediction that said that Wikipedia will "shut off anonymous edits and place other controls on changes" was at least partially true. As Nick Carr said, "the administrators adopted an 'official policy' of what they called ... 'semi-protection' to prevent 'vandals' ... from messing with their open encyclopedia." Moreover, as Eric Goldman argues, major spam attacks on Wikipedia may just be a matter of time.

Yahoo and MSN launch blog search, Technorati and Feedster lose share, Google Blog Search dominates
Mostly false true. Yahoo and MSN still do not have a separate blog search, but Ask did launch one. Feedster is suffering, but Technorati is doing surprisingly well against Google Blog Search. Feedster and Technorati are both suffering, and Google is dominating.

An impressive and more ambitious version of Google Q&A
False, at least not yet. I was expecting to see something really cool here, a product of the massive processing power of the Google cluster, but it did not happen. Though, I have to say, hints of good things to come seem to keep popping up in Peter Norvig's talks. Maybe this is just a matter of time.

A VC-fueled bubble around personalization
False. There has been interest and some funding for startups doing personalization and recommendations, but not at the absurd, frothy level I expected.

Google News adds recommendations, MSN/Yahoo experiment with personalization, all three expand in targeted advertising
Mostly true. Google News does have a widget that recommends news based on your reading history. AOL launched news recommendations in My AOL. Yahoo and MSN are both doing early experiments with behavioral targeted advertising, but have not done much elsewhere with implicit personalization.

Hype about mashups and APIs will fade
False. If anything, the hype seems to be increasing. I have not seen much evidence that people are disillusioned yet with the restrictions or lack of uptime guarantees on APIs. That may be a matter of time; a scandal like an extended downtime or sudden change to harsher terms on an API might be sufficient.
Mostly false, since there still is much hype, but there are signs of a growing backlash. See the updates at the bottom of this post.

eBay's business slows, eBay makes other acquisitions to acquire growth
Mostly true. eBay's growth has slowed. The expensive Skype deal seems to have tempered eBay's interests in additional acquisitions, but they did do a $2M acquisition of Meetup, $48M acquisition of Tradera, and a deal with Google.

Well, not such a good track record. About a third true or mostly true. Maybe I should put away my crystal ball?

Update: As John K pointed out in the comments, Microsoft adCenter is an AdWords competitor, not an AdSense competitor. Microsoft has not yet launched an AdSense competitor. Sorry, my mistake.

Update: I may have judged too soon on the lack of a backlash against APIs. Google very recently pulled their web search API, causing Dare Obasanjo to say:
One thing that is slowly becoming clear is that providers of data services would rather provide you their data in ways they can explicitly monetize (e.g. driving traffic to their social bookmarking site or showing their search ads) instead of letting you drain their resources for free no matter how much geek cred it gets them.
That is not too far from what I said back in Nov 2005:
I keep hearing people talk about as if companies are creating web services because they just dream of setting all their data free. Sorry, folks, that isn't the reason.

Companies offer web services to get free ideas, exploit free R&D, and discover promising talent. That's why the APIs are crippled with restrictions like no more than N hits a day, no commercial use, and no uptime or quality guarantees. They offer the APIs so people can build clever toys, the best of which the company will grab -- thank you very much -- and develop further on their own.
Update: I guess I also judged too soon when I said Technorati is doing surprisingly well against Google Blog Search. According to a Dec 28 article from Hitwise, the combined traffic of and is now about twice that of Technorati.

Update: Maybe I was just too early on the VC frenzy around personalization. VC Fred Wilson predicts that "the implicit web is going to start taking off in 2007" where the "implicit web", as Fred defines it, is using clickstream and other implicit information about preferences to do recommendations and personalization. Perhaps the frenzy will be in 2007, not 2006.

The RSS beast

Matt Linderman at 37 Signals posts about "Taming the RSS beast":
There should be an alternative to one-size-fits-all RSS feeds for busy sites.

Too many high-volume sites assume everyone wants to read every post. That's wishful thinking. Some readers may want 5+ posts a day from your site, but what about moderate fans who only want 5 posts a week? Or casual fans who want a mere 5 posts a month? These people just want a glass of water yet sites insist on pointing a firehose at them.
Matt goes on to quote the frustration of Khoi Vinh at his feed reader:
I've collected so damn many RSS feeds that, when I sit down in front of the application, it's almost as difficult a challenge as having no feed reader whatsoever. With dozens and dozens of subscriptions, each filled with dozens of unread posts, I often don't even know where to start.
Matt also quotes an older Wired article that nicely states the problem:
I want to solve the question of "I don't have any time and I subscribe to 500 feeds. I just got off the plane. What do I need to read?"
Current RSS readers merely reformat XML for display. That isn't enough. Feed readers need to filter and prioritize. Show me what matters. Help me find what I need.

Matt's post focuses on issues for people with hundreds of feeds in their feed reader -- a common problem for us geeks -- but I think the problem is much broader than that.

Not only do most people not want to read every post from various feeds, but most people do not want to go through the hassle of tracking down and subscribing to individual feeds in the first place. XML is for geeks, not something that should be exposed to readers. Most people just want to read news. Next generation feed readers should hide the magic of locating content.

Overall, feed readers need to do a better job of focusing on scarce attention. Readers have limited time. Feed readers should be helping readers focus, filter, and prioritize. Feed readers should throw out the crap, surface the gems, and help people manage the flood of information coming at them.

See also my Jan 2006 post, "RSS sucks and information overload".

See also my March 2005 post, "A relevance rank for news and weblogs".

See also my Sept 2005 post, "Findory RSS Reader, Part II".

Update: Got to love the title of this recent post by Nick Carr, "Lost in the shitstream".

Thursday, December 07, 2006

YouTube cries out for item authority

When I worked at Amazon, there was a lot of effort into recognizing that two items in the catalog were actually the same item. That was called item authority.

I was recently browsing around in YouTube and I noticed how bad the site is about dealing with multiple copies of the same content. For example, on Weird Al's video, "White & Nerdy", look at the related videos.

The first four are all copies of the same video. They are not "related"; they are the same video.

Of the first ten videos in that list, only three are unique. The others are all duplicates.

This problem is not unique to YouTube. On Google Video, "White & Nerdy", eight of the top ten "related" videos are identical copies of the Weird Al music video.

The point of showing me related content is to help me discover new and interesting content. Showing identical copies of the same video I just watched is not useful to me.

What is useful is helping me find interesting other videos. At a minimum, you could screen out duplicates and then show other Weird Al videos; that would be useful, if a bit obvious. Alternatively, you could show videos that interest people who liked "White & Nerdy", using other customers' actions to help me find interesting content.

Crawling the world's information is not enough. You need to make that information useful. You must help people find relevant information, help people find the information they need.

Wednesday, December 06, 2006

Spam is ruining Digg

Elinor Mills at CNet writes about "The big Digg rig", saying:
Dubious Internet marketers are planting stories, paying people to promote items, and otherwise trying to manipulate rankings on Digg and other so-called social-media sites like Reddit and Delicious.

Some marketers offer "content generation services," where they sell stories to Web sites for the sole purpose of getting them submitted to Digg and other sites.

Companies charge as much as $15,000 to get content up on Digg, said [ACS CTO] Neil Patel ... If a story becomes popular on Digg and generates links back to a marketer's Web site, that site may rise in search engine results and will not have to spend money on search advertising, he said.

Another way to get Web links to a suspicious site is to get inside help from users at a social-media site. For instance, spammers have tried to infiltrate Digg to build up reputations and promote stories for marketers, experts say.

Other scammers are trying other ways to buy votes. A site dubbed "User/Submitter," purports to pay people 50 cents for digging three stories and charges $20 for each story submitted to the site, plus $1 for every vote it gets. The Spike the Vote Web site boasts that it is a "bulletproof way to cheat Digg" and offers a point system for Digg users to submit and dig stories. And Friendly Vote bills itself as an "online resource for Web masters" to improve their marketing on sites like Digg and Delicious.
See also Niall Kennedy's recent post, "The spam farms of the social web".

See also my Sept 2006 post, "Digg struggles with spam", where I said:
These problems with Digg were predictable. Getting to the top of Digg now guarantees a flood of traffic to the featured link. With that kind of reward on the table, people will fight to win placement by any means necessary.

It was not always this way. When Digg was just used by a small group of early adopters, there was little incentive to mess with the system. The gains from bad behavior were low, so everyone played nice.

Now that Digg is starting to attract a large mainstream audience, Digg will be fighting a long and probably losing battle against attempts to manipulate the system for personal gain.
See also my July 2006 post, "Combating web spam with personalization".

See also my March 2006 post, "Growth, crap, and spam", where I said:
There seems to be a repeating pattern with Web 2.0 sites. They start with great buzz and joy from an enthusiastic group of early adopters, then fill with crud and crap as they attract a wider, less idealistic, more mainstream audience.
See also my Jan 2006 post, "Digg, spam, and most popular lists".

[CNet article found via Matt McAlister]

Yahoo's reorg

Plenty of talk today on the Yahoo reorg. I particularly liked the coverage by Elise Ackerman at the SJ Mercury News:
"We plan to drive growth and profitability by leveraging our deep audience insights to create a full-fledged advertising network," [CEO Terry] Semel said.

[CTO] Farzad Nazem will be pressed to speed development of the company's delayed next-generation advertising software.

Analysts believe one of the chief reasons behind Yahoo's current woes is its failure to deploy software that could prioritize the placement of the most lucrative ads on its Internet properties.

During the past year, Yahoo's financial performance has repeatedly disappointed investors who have sent its stock price plunging more than 35 percent.
See also my earlier post, "Yahoo's troubles", where I said, "The business is advertising ... To fail to compete on advertising is to fail."

See also Om Malik's harsher comments on the reorg. By the way, I kind of like the mission Om gave for Yahoo at the end of his post, to organize all "relevant information". That captures the importance of helping people focus their attention on useful information rather than making all information accessible.

See also my earlier post, "Yahoo peanut butter memo".

Monday, December 04, 2006

Slides from my talk at Stanford

I had the great pleasure of giving a talk today on practical issues in personalization and recommendations for the Data Mining (CS345) class at Stanford taught by Anand Rajaraman and Jeff Ullman.

The slides from my talk are available in two versions. The first version is the talk I actually gave; make sure to read the notes pages for the slides, or it will be difficult to follow. The second version is done in a very different style and should be easier to follow without me blabbing away in front of you.

It was very fun giving this talk. The students were clever, thoughtful, and enthusiastic. I was pleased to get a chance to talk again with Anand, a sharp former colleague from And, I was overjoyed to meet Jeff Ullman who, for many of us computer science geeks, is a legend because of his seminal work and books.

I hope the talk was as fun for those in the audience as it was from the podium. Thanks again, Anand, for inviting me to speak.

See also my earlier post about the excellent lecture notes that are publicly available for this data mining class. If you are working (or just dabbling) in this field, they are well worth your time to review.

Update: Matt Wyndowe, who was sitting in the class, posted some thoughts on my talk. Thanks, Matt!

Saturday, December 02, 2006

YouTube and the Google boodle

Jon Fine at BusinessWeek reports on the ongoing saga of dealing with copyright issues following the Google-YouTube deal.

From the article:
Google and YouTube are dangling nine-figure sums in front of major programming and network players -- that is, the Time Warners, News Corps, and NBC Universals of the world. Google calls these monies licensing fees, according to executives who've been involved in the discussions.

But some of them characterize the subtext like this: Don't sue us over copyrights. Take this (substantial) payment, and trust us to figure out how we'll all make serious money once we get advertising and revenue sharing worked out.

If you're a network, you can't ignore YouTube's reach ... But if you're a network ... your copyrights, and insisting on your programming's premium value, underpin the entire business model. To complicate matters, no publicly traded media company today is in a position simply to dismiss, say, $100 million. Such a sum far exceeds what any single broadcast network can extract from the online world.
I have to admit, I was surprised lawsuits were not filed immediately after the GooTube deal closed. This appears to explain why. It is not that Google thinks YouTube has no infringing content (as Eric Schmidt absurdly claimed), but that the studios are dazzled by shiny hordes of Google boodle.

It does not seem like a good position for Google. They essentially are saying: We know YouTube is illegal, but here's a huge bribe if you ignore it for now. Those are going to be expensive deals for Google; the studios know all the leverage is on their side.

And, perhaps it is naive of me to think Googlers actually believe in it, but it is hard for me to see how this fits into Google's "do no evil" philosophy. Even if you think current IP laws needs to be changed, pushing that forward by brazenly violating copyright law seems to cede the high ground.

See also my previous post, "YouTube is not Googly". That post originally was written before the deal had been announced, but the updates at the bottom have additional links and comments on Google's efforts to buy off YouTube lawsuits.

See also Om Malik's post where he said, "It is the distraction factor ... The copyright issues and all those other problems are going to strain google where it is weakest - management and control."

[Found via TechDirt]

Update: Two months later, SeekingAlpha reports, "Despite months of negotiations, Google has been unable to secure a deal to post content from any major media company on YouTube."

Update: Three months later, Viacom sues GooTube for $1B. Apparently, the bribes Google offered were not big enough. More from Don Dodge and Mark Cuban.

Thursday, November 30, 2006

MSN Search and beating Google

Dare Obasanjo from Microsoft has a good post where he argues, "Competing with Google's search engine is no longer about search results quality, it is about brand and distribution."

Dare goes on to explain that Google has become the default search engine. To win, Microsoft needs to reacquire the channels and mindshare that would make them the default.

I agree pretty much with Dare on this one. Back in 2003, things may have been different, but, at this point, I think Microsoft needs to throw around their market power to win.

Microsoft should lock up channels with partnerships, cut out Google from the defaults, make exclusive advertising deals that suck revenue away from Google, and make Live (or MSN or whatever brand they finally pick) part of the general lexicon. It is not pretty. It is not nice. But it is what they must do to win.

See also my April 2006 post, "Kill Google, Vol. 3", where I said:
Microsoft should strangle Google's air supply, their revenue stream .... Microsoft should use its size to make deals .... Microsoft should use its market power to be the exclusive ad provider for large sites .... Microsoft should ... make being an advertising provider unprofitable for others.

If Microsoft wants to win, it should play to its strengths. It should not seek to change the game. It should seek to end the game.
See also my previous post, "Google dominates, MSN Search sinks".

Update: Four months later, Microsoft appears to be making new efforts to lock up channels with partnerships. John Battelle reports, "Microsoft is offering its large enterprise customers free service and product credits if those customers push Live search inside their enterprises."

Wednesday, November 29, 2006

Google Answers croaks

Andrew Fikes and Lexi Baugher post on the Official Google Blog that Google Answers will stop taking questions and effectively shut down.

Google Answers was a clever but unpopular site where you could ask any question and have it investigated by a small group of professional researchers. Fees were quite high, so the audience was fairly limited. Moreover, it always seemed out of place with Google's normal tendency to focus on automated solutions.

In light of this shutdown, I think it is worthwhile to compare Google Answers to some of the other question answering services out there.

Danny Sullivan has a thoughtful post comparing the now defunct Google Answers to the more successful and free Yahoo Answers. Like Danny, I have been surprised by the relative success of Yahoo Answers given the low quality of both the questions and the answers.

Another interesting comparison is with the fledgling Askville and NowNow question answering services from Amazon. Those services appear to be trying to blend Google Answers (tens of dollars for answers from experts) with Yahoo Answers (free for answers from idiots); Askville and NowNow use Mechanical Turk and will charge under a dollar for answers. I am curious to see if these Amazon Q&A services succeed, or if the lesson from Google Answers is that people are not willing to pay for answers regardless of quality.

Gary Price also reminds us that Ask Jeeves had a Q&A service called Answer Point that they shut down in 2002. It apparently was similar to Yahoo Answers and was free. That may suggest that we should not conclude that the free model of Yahoo Answers is better, but that none of these community-based question answering services, whether free or not free, have legs.

On a slight tangent, I think it is great that Google is shutting down some of its failed experiments to try to keep their focus. Just last week, I was lamenting the amount of failed and failing features on Google, Yahoo, and Amazon and wrote, "Old products never die, but they should. To innovate, it is not enough to love creation. We must also love destruction."

See also my previous posts ([1] [2]) about Yahoo Answers.

Update: See also Brady Forrest's thoughts over at O'Reilly Radar.

Update: See also Nick Carr's post, "The five Google products".

Monday, November 27, 2006

Item-to-item collaborative filtering

There appears to be a little confusion in some of the research literature on the earliest work on item-to-item collaborative filtering, a recommender algorithm that focus on items rather than users.

The earliest work of which I am aware is:
G. Linden J. Jacobi and E. Benson, Collaborative Recommendations Using Item-to-Item Similarity Mappings, US Patent 6,266,649 (to, Patent and Trademark Office, Washington, D.C., 2001
That patent was filed in 1998 and issued in 2001.

A later academic paper
Greg Linden, Brent Smith, Jeremy York, Recommendations: Item-to-Item Collaborative Filtering, IEEE Internet Computing, v.7 n.1, p. 76-80, January 2003
is a more friendly description of the work in the 1998 patent. It cites the patent as previous work.

Another paper that appears to be frequently cited is:
Badrul Sarwar, George Karypis, Joseph Konstan, John Reidl, Item-based collaborative filtering recommendation algorithms, Proceedings of the 10th international conference on World Wide Web, p.285-295, May 01-05, 2001, Hong Kong, Hong Kong
Some publications mistakenly have written that Sarwar et al. first "introduced" or "proposed" item-to-item collaborative filtering. Even The Economist got this wrong, later issuing a correction.

This confusion may be because the Sarwar et al. paper did not reference the patent. The 1998 patent preceded Sarwar et al. by three years and was public information well before the Sarwar et al. paper was published. The 1998 work probably should have been cited by Sarwar et al., certainly if any of the authors had reviewed it, but probably even if they had not.

I realize the situation is complicated. The 1998 patent usually is referenced with its 2001 issue date, as is the convention, making it less clear that it preceded Sarwar et al. by three years. The Sarwar et al., the first academic publication on the algorithm, does not reference the patent at all, further confusing the issue.

Nevertheless, please be careful about crediting the earliest work. Please note the earlier publications when writing about item-to-item collaborative filtering.

Saturday, November 25, 2006

Conquering small displays

Much of the UI effort in mobile focuses on the hard problem of picking what information to display on tiny screens. Many of the mobile search startups are focused on this problem exclusively, but the solutions are unsatisfactory.

When I look at this problem and the effort going into it, I wonder if we are just a couple years away from a hardware solution that makes much of it obsolete.

To see what I mean, let me dive back to a year ago when I was enjoying an excellent talk at UW CS by Patrick Baudisch from Microsoft Research. The talk, which is available for download, asked:
How can we display complex documents on displays the size of a stamp? How can users interact with such documents?
Pat proposed summarization and attention-focusing techniques as the solution:
"halo" helps users perform spatial reasoning on large maps; "summary thumbnails" and "collapse-to-zoom" allow users to make sense of web pages by compressing them to the size of the phone screen.
It is a fascinating subject, summarizing information and focusing attention on small devices. But, after watching this talk, I wondered how much of this problem is a real, long-term problem or a temporary one created by our current hardware.

For example, I could imagine a small, monocular-like device that I hold up to my eye. Looking through this, I could see what would appear to be a massive screen covering most or all of my field of vision, not that much different than sitting 12" away from a 20" flat screen display.

Even better, maybe the form factor could be sunglasses and the image could be drawn on the glass or projected directly on the retina.

I tried to get at this with an e-mail question to Pat after the lecture, asking:
Is the problem actually the small screen? Or is it really the low resolution of the small screen? If, for example, screens on cell phones had 1280 x 1024 resolution in a screen only a couple inches on each side, would this change the problem?

As you said in the talk, the problem seems to be centered around readability. If the resolution was high enough that the screens were readable if held close to the eyes, would that change the nature of the problem?
I may have failed to describe the idea well. Pat responded that he was concerned about people with poor eyesight being able to focus on and read a tiny but high resolution screen. However, I am fairly sure that, if the device is held up to the eye, the image could be displayed so that the eye should be focused at infinity, not on the device an inch away.

The idea here is fairly obvious. Small displays do not appear small if they are held close to the eye. A virtual display can appear massive even if coming from a small device.

I suspect all we need is the ability to display at high resolution on a tiny screen.

So, is the problem of optimizing content for tiny screens a real, long-term problem? Or is it one that soon will disappear as hardware improves?

Update: About a year later, the NYT reviews the Myvu Universal, virtual display glasses where "the picture appears to float a few feet in front of you."

Update: Fourteen months later, the NYT reports on "the Pico Projector ... a card-sized device that connects to a cell phone or other gadget and uses a laser to project an image at the equivalent size of a 60-inch television screen."

Friday, November 24, 2006

My AOL launches news recommendations

Sam Sethi at TechCrunch UK reports that My AOL integrated "personalized content recommendations" into their beta feed reader.

There are two sections, "People Like Me Content" and "Recommended Content". The "People Like Me Content" help popup says:
As other people use My AOL, they occasionally click on stories that are similar to the items you have selected. Our system recognizes these similarities and provides additional content that might be of interest to you.
The "Recommended Content" help popup says:
These are personalized content recommendations. As you click on headlines within My AOL, we (well, our computers) “learn” what you like and suggest similar stories.
The difference is not apparent to me -- both say they use your reading history to find similar content -- but perhaps one group of recommendations are user-based and the other is content-based.

The service appears quite similar to Findory -- perhaps even inspired by Findory -- but the quality of the recommendations seems a bit off in my tests.

For example, I clicked on six articles on My AOL, a TechCrunch article about Google Blog Search and the five most recent articles from my weblog, Geeking with Greg, which happen to all be about Microsoft, Google, Yahoo, and Amazon. My "People Like Me Content" was:
  • RR of the Day: 1984 Alfa Romeo GTV6
  • Google's poo apparently doesn't smell
  • Big Brother Is Listening
  • Virtualization Disallowed For Vista Home
  • EPA to Regulate Nanoproducts Sold As Germ-Killing
  • Suspect Captured at Miami Herald
  • British Government Attacks Own Citizens
  • USA TODAY: Teacher's Space Goal Delayed 21 Years
Not good at all. However, the "Recommended Content" was quite a bit better, though very tightly focused on search:
  • Interesting Yahoo result in Google Search
  • My favorite blogger/blog of the moment...
  • Tracking a package through MSN Search
  • RSS mashup: Amazon, eBay, Yahoo! product results
  • Finding Search related Jobs
  • Community Powered Search
  • Become helps you search a little closer
  • Yahoo! vs. Google -- more or less peanut butter?
For comparison, here are the recommendations you get if you click the same articles on
  • Google's Kirkland Office
  • The Inefficiency of Feed Readers
  • Google AdSense Gift 2006: Digital Photo Frame
  • Search Engine Thanksgiving Logos 2006
  • Top Web Apps in Serbia
  • Canadian ISPs Launch Fight Against Child Porn
  • Yahoo & Peanut Butter Market Timing
  • AbbreviationZ : Acronyms & Abbreviations Search Engine
Certainly, My AOL's news recommendations are not bad for a first effort. I would expect them to improve in time as they gather more data and refine their algorithms.

I find it very interesting to see this new feature coming out of AOL. Other than Google Personalized Search, this effort from My AOL looks like the biggest use of recommendations and personalization from the search giants so far, bigger than the recommended stories in Google News and MSN Newsbot and the feed suggestions in Bloglines.

I wonder if we will soon see additional personalization and recommendation features launched by the search giants, not just in news and feeds, but also in podcasts, videos, search, and advertising.

Update: In the comments for this post, Jim Simmons (PM, Personalization, AOL) confirms that the difference between "Recommended content" and "People Like Me Content" is content-based vs. user-behavior-based recommendations.

Thursday, November 23, 2006

Amazon crashes itself with promotion ran a special promotion today that offered an Xbox system for $100, about 1/3 of the normal price, starting at 11am.

Broadband Reports posts about what happened:
So many people were waiting for the promotion that the entire Amazon website - not just the promotion page - sank without a trace from just before 2pm, to at least 2:12pm. The home page, the product pages, everything, were unavailable.
Sounds familiar. When I was at Amazon, every year we in engineering would try to avoid spikes in traffic, especially around peak holiday loads, and every year marketing folks would want to run some promotion specifically designed to create a mad frenzy on the site. Usually, we convinced them to change the promotion, but apparently engineering lost (or was asleep at the switch) this year.

Broadband Reports goes on to point out that this reflects badly on Amazon:
We wonder how many amazon shoppers elsewhere in the site abandoned their purchases halfway through after they found their experience destroyed by the vote rush going on in the next room ... Some people got quite irate.

The poor performance of the amazon site during the giveaway also reflects badly on the Amazon "elastic compute cloud" offering (Amazon EC2) which is designed, supposedly, to offer instant capacity to companies which need to deal with exactly this kind of sudden rush.
I don't think it quite works that way. A DDoS attack, which this effectively was, can generate way over the x10 peak load for which the website would be designed. Even so, it still is pretty lame for Amazon to DDoS itself.

It appears the contest is running again next week with the same structure. I wonder if Amazon will crash itself again?

Update: It appears Amazon is looking at changing the structure of this promotion to prevent another brownout. Currently, there is a message up that says, "Due to the popularity of Amazon Customers Vote, we are extending the Week 2 voting period. Customers who cast a vote will be sent an e-mail notification of the new sale date."

Update: Mike at TechDirt reports that "Amazon Cries 'Uncle' On Promotion Traffic" by changing the rules to prevent another outage.

Wednesday, November 22, 2006

Google dominates, MSN Search sinks

Danny Sullivan posts detailed numbers on search market share, combining data from Comscore, NetRatings, and Hitwise.

To summarize it all, Google grabbed more share at everyone else's Microsoft's and AOL's expense.

Particularly bad off was MSN Search, as Danny shows in this graph of Microsoft's market share in search:

Ouchie. As Danny says, "[Not] a pretty picture for Microsoft ... They haven't held share. It's drop, drop, drop."

It really is remarkable how badly Microsoft is doing against Google. I never would have thought that, nearly four years after they started their "Underdog" project to build a Google-killer, Microsoft would not only be badly behind in search, but also actually losing market share.

See also my earlier posts, "Yahoo and MSN cannot compete?" and "Kill Google, Vol. 3".

Update: Corrected the post to say that Google's gains came at the expense of Microsoft and AOL alone. Yahoo and Ask appear to have held share.

Update: For full disclosure, I should say that Chris Payne and I talked about Underdog back in 2003 (before I started Findory). Not to worry, I had no influence to speak of -- Chris and I disagreed on what was necessary to beat Google -- but I certainly am more critical of MSN Search's missed opportunities because of that history.

Update: Erik Selberg (creator of Metacrawler, now at MSN Search) takes issue with those who would criticize Microsoft's progress, saying "Well, what did anyone really expect?" and "It's not realistic to think that it can be done quickly." He also has some thoughts on the problems at Yahoo and upcoming stagnation at Google. Definitely worth reading his point of view.

Update: Coming full circle, Danny Sullivan comments on Erik's post.

Update: And a follow up post from Erik. Erik says. "Microsoft might beat Google. And Google might beat Microsoft .... Google is pressing ahead, and they've got a big lead ... Unless they do something monumentally stupid, which I doubt, it'll be a long, tough challenge to catch and beat them." He also defends the decision to move to the Live brand. Again, worth reading.

Update: There is also some discussion with Erik and others in the comments for this post.

Update: See also my follow-up post, "MSN Search and beating Google", that includes some good thoughts from Dare Obasanjo.

Update: A couple weeks later, Saul Hansell at the NYT writes:
There is a lot about the way Microsoft has run its Internet business that Steve Berkowitz wants to change. But he is finding that redirecting such a behemoth is slow going.

The pressure is on for Mr. Berkowitz to gain control of Microsoft's online unit, which by most measures has drifted dangerously off course.

Over the last year, its online properties have lost users in the United States. The billions of dollars the company has spent building its own search engine have yet to pay off. And amid a booming Internet market, Microsoft's online unit is losing money.

Google, meanwhile, is growing, prospering, and moving increasingly onto Microsoft's turf.

Microsoft lost its way, Mr. Berkowitz says.
The article goes on to show Microsoft's steep drop in market share, talk about brand confusion between MSN and Live, and discuss how far behind MSN Search appears to be in relevance.

Update: Two months later, Danny Sullivan reports that Microsoft's search market share is continuing to decline.

Tuesday, November 21, 2006

Innovation and learning to love destruction

Marissa Mayer says that Google operates "like small companies inside the large company" and feels "a lot like managing a VC firm."

Jeff Bezos "encourages experimentation ... as much of it as possible" in order to "maximize invention per unit of time", "invent as many things per day per week as you can manage", and get "faster innovation".

Innovation and experimentation, that is seen as the way to get ahead. Build, create, innovate.

For this strategy to work well, companies cannot only be quick to create. They need to be quick to destroy. If something does not work, the company needs to move on quickly. Failures need to be acknowledged, all possible learning extracted, and then the product should be eliminated.

This is not what happens. Instead, unsuccessful products are left up on the site to rot. Failed experiments become useless distractions, confusing customers who are trying to dig through the options to find what they need and frustrating any customer foolish enough to try them with the obvious lack of support., for example, has 63 links on their "all product categories" page, a confusing mess that paralyzes anyone looking for a book or DVD with irrelevant and useless choices. Why do all these continue to exist? Why do Auctions and zShops hang around for years after they failed to attract an audience? Why do detail pages accumulate more and more "exciting new features" until I cannot find the customer reviews anymore under the sea of crap?

Google has 36 products and another 20 in Google Labs. It is enough that an exasperated Sergey Brin said, "I was getting lost in the sheer volume of the products that we were releasing." Admitting that "myriad product releases were confusing their users", Google is pushing its teams to develop "features, not products."

Yahoo has so many services I cannot even count them, let alone find what I want. As a now infamous internal memo pointed out, many of these products overlap with each other, perform poorly, or both. The memo pleaded for the company to find focus, asking Yahoo's management to "definitively declare what we are and what we are not."

Innovation is the process of creative destruction. Improved products destroy the failed products. Innovation is a churning cauldron of life and death.

Google and Amazon claim to be like VC firms, creating little startups within their company, but they lack the process of destruction. At these companies, old products live forever. Failures become zombies, surviving with skeleton teams and little resources, but still managing to distract the company while confusing users.

Old products never die, but they should. To innovate, it is not enough to love creation. We must also love destruction.

Monday, November 20, 2006

Creating a smart Google

Jeffrey O'Brien at Fortune writes about "The race to create a 'smart' Google". Some excerpts:
Recommender systems ... are sprouting on the Web like mushrooms after a hard rain. Dozens of companies have unveiled recommenders recently to introduce consumers to Web sites, TV shows, other people - whatever they can think of.

The company that can decipher all that information ... will pinpoint your tastes and determine the likelihood that you'll buy any given product. In effect, it will have constructed the algorithm that is you.

There's a sense among the players in the recommendation business ... that now is the time to perfect such an algorithm.

The Web, they say, is leaving the era of search and entering one of discovery. What's the difference? Search is what you do when you're looking for something. Discovery is when something wonderful that you didn't know existed, or didn't know how to ask for, finds you.
I couldn't have said it better myself. You cannot search for something if you don't know it exists. Discovery helps surface interesting gems without any effort, without any explicit search, from a sea of information.

There is a good quote in the article from John Riedl, someone who has been working on recommender systems longer than just about anyone:
"The effect of recommender systems will be one of the most important changes in the next decade," says ... professor John Riedl ... "The social web is going to be driven by these systems."
A good friend, Brent Smith, is also quoted:
Amazon realized early on how powerful a recommender system could be and to this day remains the prime example. The company ... [compares] your purchasing patterns with everyone else's and thus narrow a vast inventory to just the stuff it predicts you'll buy.

"Personalized recommendations," says Brent Smith, Amazon's director of personalization, "are at the heart of why online shopping offers so much promise."
The article does focus on promise, taking a negative tone toward well established, lucrative systems at companies like Amazon and Netflix but giving startups, some of which have little more than vaporware, the benefit of the doubt.

It is a little unfortunate. The article leads with a sensationalistic title -- that Google sure ain't that smart, heh, heh, snark, snark -- but then fails to show anything that clearly represents progress toward a smarter Google. In fact, after name dropping Udi Manber and Peter Norvig, the article even holds up Google as the likely leader in the race to build a smarter Google.

But, overall, I agree that recommender systems are growing in importance, especially in terms of application to web search, advertisements, and video, and that future recommendation systems will be even more lucrative than they are now. As a good colleague of mine was fond of saying, "The future will be personalized."

Update: Mike at TechDirt doesn't like the hype either, and then goes a step further by slamming all recommender systems as "far from useful", "exceptionally limited", and "littered with failures". While I think it is going too far to condemn all recommender systems -- I am not sure Mike is aware of how much money personalization features generate for, for example -- his post is good for a contrarian view.

Saturday, November 18, 2006

Yahoo "peanut butter" memo

Paul Kedrosky posts a brutally critical leaked internal memo from Yahoo. Don't miss it.

Some selected excerpts:
We lack a focused, cohesive vision for our company .... We lack clarity of ownership and accountability.

We end up with competing (or redundant) initiatives ...
  • YME vs. Musicmatch
  • Flickr vs. Photos
  • YMG video vs. Search video
  • vs. myweb
  • Messenger and plug-ins vs. Sidebar and widgets
  • Social media vs. 360 and Groups
  • Front page vs. YMG
  • Global strategy from BU'vs. Global strategy from Int'l
We have lost our passion to win. Far too many employees are "phoning" it in.

We need to boldly and definitively declare what we are and what we are not .... Focus the vision .... Restore accountability .... Blow up the matrix .... Kill the redundancies ... [Stop] competing against each other.

Change is needed and it is needed soon. We can be a stronger and faster company.
See also my earlier post, "Yahoo's troubles", where I said, "The business is advertising ... To fail to compete on advertising is to fail."

Update: Dare Obasanjo points out:
Yahoo! executives are contemplating firing one in five Yahoo! employees ... Layoffs are a demoralizing affair and often don't eliminate the right people especially since the really smart people know to desert a sinking ship instead of hanging around to see if they draw the short straw.
Google is a mere 5.8 miles down the road. It may be hard for Yahoo to retain their best at this time of instability.

Tuesday, November 14, 2006

Excellent data mining lecture notes

I have been reading and enjoying the slides from the Stanford CS Data Mining class being taught by Anand Rajaraman and Jeff Ullman.

The talk on recommender systems (PDF) was particularly interesting, with a thorough and insightful look at different techniques (e.g. collaborative filtering, item-to-item, content-based, clustering) for personalization and recommendations. Note that one of the options for the class project is working on the Netflix contest.

The talks on association rules (PDF #1, PDF #2) were fun with some clever applications discussed (e.g. detecting plagiarism) and nice optimizations (e.g. sampling the data set at the limit of main memory multiple times to determine which data can be ignored in a full run).

The clustering talks are also worthwhile, focused on handling very large data sets and clearly explained. Finally, if you are working on web search (or are an evil SEO), it is worth reviewing the talks on page rank and web spam.

Looks like a great class. Impressive that this is all being covered at the undergraduate level.

Ruthless enough for a startup?

I have been reading about how several successful startups -- Facebook, MySpace, BitTorrent, YouTube, Skype, and HotOrNot -- fueled their early growth that lead to their success.

In all the cases, these startups did things that I probably would not have been willing to do. It makes me wonder how ruthless you have to be to have a successful startup.

Facebook, for example, "had access to the e-mail addresses of Harvard students" and "blasted e-mails to Harvard students to let people know about the site." The site, which allows people to list information about themselves and meet other students, largely seems driven by social interaction, dating, and self-promotion.

Similarly, MySpace "had a database of ~100M e-mail addresses" which they spammed to announce their launch. MySpace is broader than Facebook, but also largely seems driven by social interaction, dating, and self-promotion.

BitTorrent, a P2P filesharing network, launched after "[Bram] Cohen collected a batch of free porn and used it to lure beta testers". The site soon collected "long lists of pirated content" including full length movies and pornographic material.

HotOrNot started as a way to "settle an argument" and soon turned into a popular and lucrative website that is "serving some basic human psychological needs around social validation, ego, and voyeurism" and allows people to "enjoy the voyeuristic aspect of checking people out."

The YouTube founders, when they first launched, "figured the best thing would do would be to get hot chicks involved". Later, they implemented a feature that allowed "one-click emailing to spam a friend about a video." I suspect it is also true that YouTube succeeded where many other video startups failed largely by being less vigilant about purging copyright content and soft porn, all easy to find on the site.

Skype's founders originally created Kazaa, a filesharing network that encouraged illegal trading of copyright content. Skype is a clever iteration from Kazaa. It follows a similar theme of giving away stuff that used to cost money for free, but this time it is legal. Like Kazaa, it uses other people's resources (especially those who are blessed with the privilege of being a supernode) to provide the service.

There seem to be some dismal lessons in these stories. It appears the ideal startup will give away something that used to cost money for free (preferably copyright material and porn), use other people's content and resources, appeal to the baser human instincts (especially vanity and sex), and spam massive e-mail lists at launch.

And this makes me wonder, am I ruthless enough? In Findory Video, for example, the system tries to automatically filter the soft porn that appears quite popular on both YouTube and Google Video. Is that a mistake? Findory has never spammed anyone. Findory keeps well within fair use for copyright material. Findory directs traffic to content providers to help them earn revenue from their work. Are those mistakes?

Is ruthlessness the key to success for Web 2.0 startups? Are you ruthless enough to succeed in the same way these others have done?

Update: Yahoo just acquired Bix, a website that features "hot or not and other contests" and "launched barely three months ago". Yet another example.

Sunday, November 12, 2006

AI and "Web 3.0"

When I saw John Markoff's article, "Entrepreneurs See a Web Guided by Common Sense", on the front page of the NYT today, I did not know whether to feel excited or dismayed.

On the one hand, the distant goals of many working on information retrieval were nicely laid out in the article:
Computer scientists and a growing collection of start-up companies are finding new ways to mine human intelligence. Their goal is to add a layer of meaning on top of the existing Web that would make it less of a catalog and more of a guide -- and even provide the foundation for systems that can reason in a human fashion.

In the future, more powerful systems could act as personal advisers in areas as diverse as financial planning, with an intelligent system mapping out a retirement plan for a couple, for instance, or educational consulting, with the Web helping a high school student identify the right college.

The Holy Grail ... is to build a system that can give a reasonable and complete response to a simple question like: "I'm looking for a warm place to vacation and I have a budget of $3,000. Oh, and I have an 11-year-old child."
But then, the article discredits this vision by attaching it to the buzzword "Web 3.0". Readers easily could ignore the caveats in the article, see the absurd claims that Flickr and Digg represent substantial progress in AI, and then come away with the impression that intelligent web applications are less than decades away.

Overpromising and underdelivering caused much disenchantment with artificial intelligence in the 1970's and 1980's. It would be a shame to see it happen again.

While I subscribe to the vision and goals laid out, I want to emphasize the words of the skeptics. From the article:
Artificial intelligence, with machines doing the thinking instead of simply following commands, has eluded researchers for more than half a century.

Referred to as Web 3.0, the effort is in its infancy, and the very idea has given rise to skeptics who have called it an unobtainable vision.

Researchers and entrepreneurs say that while it is unlikely that there will be complete artificial-intelligence systems any time soon, if ever.
It is true that we are building more intelligent Web applications. Some of these systems do simple learning and adaptation using the behavior of their users. For example,'s website adapts to the interests of each shopper and improves the more it is used.

But it is a long way from this to the Holy Grail. These early applications work from detecting patterns in data. They have no understanding of language. They cannot reason about user goals. They have no base of knowledge that would allow them to make common sense connections.

There is no way in which these early systems can take a goal like "Plan for me a warm vacation appropriate for my 11 year old" and reason about it like a travel agent would. Building that application, while a noble and worthy challenge, is at least decades off.

AI researchers, do not overpromise and underdeliver again. Cut out the "Web 3.0" hype. Let's be realistic. Even without the chimerical Holy Grail of AI, we can help people find and discover what they need.

Saturday, November 11, 2006

Andrei Broder talk on information supply

Yahoo VP and Research Fellow Andrei Broder is giving an IEEE talk, "The next generation Web Search: From Information Retrieval to Information Supply" on Nov 16 at Stanford University.

The idea of "information supply" is very close to the idea that I spend my time working on and advocating, information personalization and recommendations. From the abstract for Andrei's talk:
The goal of Web IR will widen to include the supply of relevant information from multiple sources without requiring the user to make an explicit query.

[We can] supply relevant information specific to a given activity and a given user, while the activity is being performed. A prime example is the matching of ads to content being read, however the information supply paradigm is starting to appear in other contexts such as social networks, e-commerce, browsers, e-mail, and others.
A Yahoo Research page, "Toward the Next Generation of Search", elaborates on Andrei's thoughts on personalization of information, saying:
Andrei Broder ... foresees pushing search toward information supply: serving up answers to users' questions that they haven't even typed in a search box.

Worry not, users; this isn't mind reading. But with statistical analysis of people's surfing habits and creative algorithms, we ... hope to figure out users' intents and understand their context, so we can supply them with useful information.

It could get displayed in a variety of ways, such as recommended links or intelligent, personalized footnotes dynamically served up on the bottom of a Web page.

"It's a little bit of the 'push' paradigm," Broder says, but he says the way the information is presented to users is key, so that it is unobtrusive but useful.
Andrei has given what appear to be similar versions of this talk over the last year. For example, here are slides (PDF) from a May 2006 talk titled "From query based Information Retrieval to context driven Information Supply". I wrote about that talk back in June 2006.

See also my posts ([1] [2] [3]) about related work by Susan Dumais and Eric Horvitz at Microsoft Research.

Thursday, November 09, 2006

Hadoop on Amazon EC2

Hadoop, an open source clone of Google FS and MapReduce, can be run on top of Amazon EC2, a hosting service that allows leasing servers on an hourly basis.

The details of setting this up are available at the node "AmazonEC2" on the Lucene-Hadoop Wiki at

When looking for more about this, I noticed that the hyped-but-not-launched natural language search engine Powerset appears to be leading the charge on using Hadoop on EC2. From the Hadoop mailing list:
From: Gian Lorenzo Thione <>
Date: Fri, 25 Aug 2006 23:04:16 GMT

At Powerset we have used EC2 and Hadoop with a large number of nodes, successfully running Map/Reduce computations and HDFS. Pretty much like you describe, we use HDFS for intermediate results and caching, and periodically extract data to our local network. We are not really using S3 at the moment for persistent storage.

A nice feature of Hadoop as measured against our use of EC2 has been the capability of fluidly changing the number of instances that are part of the cluster. Our instances are set up to join the cluster and the DFS as soon as they are activated and when - for any reason - we lose those machines, the overall process doesn't suffer. We have been quite happy with this, even at significant number of instances.
That is an interesting detail on the recent announcement that Powerset is a heavy user of Amazon's EC2.

I am not sure I have an immediate use for Hadoop on EC2, but it is nice to see. Developers may now be able to rapidly bring up hundreds of servers, run a massive parallel computation on them using Hadoop's MapReduce implementation, and then shut down all the instances, all with low effort and at low cost. Very cool.

[Wiki node found via John Krystynak]

Update: Eight months later, Tom White posts a tutorial, "Running Hadoop MapReduce on Amazon EC2 and Amazon S3". [Found via Todd Huff]

Marissa Mayer at Web 2.0

Google VP Marissa Mayer just spoke at the Web 2.0 Conference and offered tidbits on what Google has learned about speed, the user experience, and user satisfaction.

Marissa started with a story about a user test they did. They asked a group of Google searchers how many search results they wanted to see. Users asked for more, more than the ten results Google normally shows. More is more, they said.

So, Marissa ran an experiment where Google increased the number of search results to thirty. Traffic and revenue from Google searchers in the experimental group dropped by 20%.

Ouch. Why? Why, when users had asked for this, did they seem to hate it?

After a bit of looking, Marissa explained that they found an uncontrolled variable. The page with 10 results took .4 seconds to generate. The page with 30 results took .9 seconds.

Half a second delay caused a 20% drop in traffic. Half a second delay killed user satisfaction.

This conclusion may be surprising -- people notice a half second delay? -- but we had a similar experience at In A/B tests, we tried delaying the page in increments of 100 milliseconds and found that even very small delays would result in substantial and costly drops in revenue.

Being fast really matters. As Marissa said in her talk, "Users really respond to speed."

Marissa went on to describe how they rolled out a new version of Google Maps that was lighter (in page size) and rendered much faster. Google Maps immediately saw substantial boosts in traffic and usage.

The lesson, Marissa said, is that speed matters. People do not like to wait. Do not make them.

Tim O'Reilly on harnessing collective intelligence

Tim O'Reilly just ran a panel here at the Web 2.0 Conference on "Disruption: Harnessing the Collective Intelligence".

Tim's introduction to the panel reminded me of his speech six months ago at UC Berkeley where he said:
A true Web 2.0 application is one that gets better the more people use it ... The real heart of Web 2.0 is harnessing collective intelligence.

[In] the world of Web 2.0 ... we share our knowledge and insights, filter the news for each other, find out obscure facts, and make each other smarter and more responsive.
At the time, after reading Tim's speech, I wrote:
I like this new definition of Web 2.0, "harnessing collective intelligence." I like the idea we are building on the expertise and information of the vast community of the Web. I like the idea that web applications should automatically learn, adapt, and improve based on needs.

I also like the idea that "Web 2.0" should include many companies that people were trying to classify as "Web 1.0"., with its customer reviews and personalized pages, clearly is harnessing the collective wisdom of Amazon shoppers. Google also is constantly improving based on the behavior of searchers.

Web 2.0 applications get better and better the more people use them. Web 2.0 applications learn from the behavior of their users. Web 2.0 applications harness collective intelligence.
The definition of Web 2.0 remains vague in most people's minds. Some describe it as a new dot com boom. Some say it is about tagging or social networks. Some say it is about fancy AJAX widgets.

I think Tim has helped clarify it with his focus on harnessing collective intelligence. If the application does not improve from the contributions and knowledge of its users, if it does not get better and better as more people use it, it is not a Web 2.0 application.

Update: In response to a question by what Tim meant by "intelligence", Tim cited Sturgeon's Law which he paraphrased as "95% of everything is crap". He said intelligence is surfacing the 5% that is not crap to the right people at the right time.

Update: Tim posted some additional thoughts.

Marten Mickos at Web 2.0

MySQL CEO Marten Mickos just gave at talk on at the Web 2.0 Conference on "The Great Database in the Sky".

The idea is that "structured data should be open sourced", linked, and easily accessible. The idea is to do something like Google does for unstructured data (web documents) for structured data (database records).

This is not a new idea. People usually talk about this as querying heterogeneous distributed databases. The trick is matching up disparate data definitions and smoothing over bad data. And that is quite a trick.

One technique is to require people to publish their data in some format that is easier to merge and process, but that requires all databases to cooperate. Another technique is to wrap databases with some translation layer, but that requires custom (and often fragile) wrappers for each database.

It's an interesting problem. I think there are good examples of doing some of this for specific domains -- metashopping searches like Shopzilla,, and, for example -- but Marten has said that MySQL will be leading a much broader push.

Wednesday, November 08, 2006

Jim Lanzone and Steve Berkowitz at Web 2.0 CEO Jim Lanzone and Microsoft SVP Steve Berkowitz had an interview with John Battelle a couple hours ago here at the Web 2.0 Conference on "Beating Google at Their Own Game".

The most interesting part to me was near the end when there was a question from the audience asking for their thoughts on personalized search.

Both Jim and Steve's answers struck me as odd. Steve entirely focused on privacy issues. He argued for giving users detailed and complete control of their data. Steve claimed this was being customer-focused, but I felt he was focusing on entirely the wrong customer. Most customers do not want to spend time twiddling configuration settings for their data; they just want to find what they need. Customer-focus for personalized search should mean helping people find and discover the information they need.

Jim also had an unusual focus, saying that "users don't customize", "users are lazy", and "the majority of people won't do it." The questioner followed up at this point, asking about implicit personalization of search, which works from behavior and requires no effort. Both Jim and Steve indicated that they thought this was a good idea, but offered nothing more.

It is surprising to me that Jim and Steve seem to have not thought much about personalized search. I was expecting to hear something deeper from them on this topic.

Personalized search is a potential way to beat Google. Paying attention to what each searcher has done allows individualized relevance ranks and should yield more relevant search results. Whoever can crack this nut could win the search war.

Update: There is some broader coverage of this talk and Ray Ozzie's talk, all put in good context, by David Needle at InternetNews.

Update: Another article, this one by Dan Farber at ZDNet, with broader coverage of this talk. [via John Battelle]