Geeking with Greg: 03/01/2006

Friday, March 31, 2006

The audience you haven't reached yet

Michael Calore at Wired says:

There's still a fundamental disconnect between people who use the web and people who use the web 2.0.

The perceived importance and revolutionary aims of most web 2.0 apps are lost on the vast majority of web surfers. That doesn't make those people less smart, less hip, or less important than your more savvy users, it just makes them an audience you haven't reached yet.

Most people in web 1.0 land (a.k.a. the "Rest of the World") would expect your site to behave just like familiar destinations like Ebay and Amazon. People respond to familiar interfaces and functions. Throwing them a curveball will most likely result in confusion.

Excellent advice.

See also my previous post, "The mainstream and saving people time".

Removing registration and Topix.net traffic

Blake at Topix.net reports that removing mandatory registration in their forums substantially increased posting frequency and traffic.

Surprisingly, their "post kill-rate", apparently a measure of spam, dropped by a factor of two as well.

From Blake's post:

Back on December 12th, we released a site redesign that included user forums on each of our news pages .... One month after launch, we were still under 200 posts a day.

Could we take the registration down? Of course the volume would go up, but what would happen to the quality of the posts? ... Was this going to double or triple the amount of spam and profanity we needed to parse through? Would an army of trolls invade and set up a siege?

Since removing registration, our volume has exploded and just this morning we just passed a quarter-of-a-million aggregate posts on our system.

And the quality of posts? To our surprise, our post kill-rate has actually dropped -- hovering below 2%. This is less than half of the number incurred when registration was in place.

We think it's the "Ni-chan paradox" .... Registration keeps out good posters ... Registration lets in bad posters ... Registration attracts trolls ... Anonymity counters vanity.

If the intent is to control spam, requiring registration may do more harm than good.

Slashdot is another great example. Slashdot allows anonymous posts in their forums. To control spam and surface good posts, Slashdot relies on user moderation.

It might be possible to extend this lesson to other areas. For example, many newspapers require registration, repelling many visitors.

Mandatory registration at newspapers has been a heated topic of debate. Here is a sample of some of it: [1] [2] [3] [4] [5] [6] [7].

Some newspapers, most recently the Houston Chronicle and Toronto Star, decided to eliminate mandatory registration, citing their desire to improve usability and increase traffic as the major factors in their decision.

Eliminating registration does not mean giving up on targeting advertising and content. Findory, for example, does not require registration, but still helps readers find relevant content and advertising by carefully paying attention to which news stories interested each person.

Saturday, March 25, 2006

I want a big, virtual database

In my previous story about Amazon, I talked about a problem with an Oracle database.

You know, eight years later, I still don't see what I really want from a database. Hot standbys and replication remain the state of the art.

What I want is a robust, high performance virtual relational database that runs transparently over a cluster, nodes dropping in an out of service at will, read-write replication and data migration all done automatically.

I want to be able to install a database on a server cloud and use it like it was all running on one machine.

Yes, I realize how hard this is to engineer, especially if you deal well with nasty issues like intermittent slowness or failures of nodes, major data inconsistencies after a network partition of the cluster, or real-time performance optimization based on access patterns.

I still want it. And I'm not the only one.

See also my previous posts, "Google's BigTable" and "Clustered databases".

Early Amazon: Oracle down

The load on Amazon's website was remarkable.

As an ex-Amazonian said on one of my earlier posts, "There wasn't anything [Amazon] could buy ... that would handle the situations Amazon was encountering ... it was all ... written in-house out of necessity."

Not everything was written in-house, of course. Amazon needed a database. It used Oracle.

Unfortunately, our use of Oracle was also unusual. The strange usage patterns and high loads we inflicted on our Oracle databases exposed bugs and race conditions they had never seen before.

At one point in early 1998, the database had a prolonged outage, taking the Amazon.com website with it. The database tables were corrupted. Attempts to restore the database failed, apparently tickling the same bug immediately. We were dead in the water.

My DBA skills are even weaker than my sysadmin skills. Though some like Bob impressively leapt at the problem with pure debugging fury, there was little others like myself could do that would be helpful. We sat on the sidelines, watched, and waited.

It was like having a loved one in the hospital. Amazon was down and stumbled every time she tried to rise.

Through the herculean efforts of Amazon's DBAs and the assistance of Oracle's engineers, the problem was debugged. Oracle sent us a patch. The helpless fear lifted. Amazon came back to us at last.

Friday, March 24, 2006

Query log data for ad targeting

A WWW2006 paper out of Microsoft Research, "Finding Advertising Keywords on Web Pages" (PDF), claims that query log data is particularly useful for ad targeting.

Specifically, the researchers extracted from MSN query logs the keywords some people used to find a given page. They tested using that as one of many features for ad targeting. In their results, it was one of the most effective features.

Very interesting. It has always been harder to target ads to content than to search results because intent is much less clear.

By using the query log data in this way, the researchers were effectively using the intent of the searchers that arrived at the page as a proxy for the intent of everyone who arrived at the page.

See also my previous post, "When AdSense goes bad".

[WWW2006 paper via SearchFind.it and Antonio Gulli]

Search as a matching engine

Steve Smith at OMMA writes about the future of search. Some extended excerpts, focusing on personalization of advertising, news, web search, and mobile search:

In 2006, the focal point for the future of search has shifted dramatically. The search box is going beyond the desktop to evolve into a ubiquitous engine that matches both content and laser-targeted marketing to our desires ...

The longstanding hope of personalized search is starting to bear fruit. For example, Google's News page now recommends stories based on a searcher's click history.

"The writing is on the wall. Behavioral targeting and demographic profiling will be the next layer in search," [MoreVisibility EVP Danielle Leitch] says, pointing to MSN's plan to include both options in AdCenter.

Yahoo hopes to ... tweak and target results so they account for subjective qualities like trustworthiness and personal taste. As search becomes the interface for a wider range of content, especially video, merging it with a recommendation engine may be as important as tagging video.

Consumers want relevant answers, not "search results," adds Pankaj Shah, CEO, 4Info ... "The differentiators will be how personalized [mobile search] is." For example, "answer engines" that send relevant results and learn from user histories ...

Search should be a matching engine. Not everyone agrees on how relevant a particular page is to a particular search. Search should match each person's interests and desires to relevant content. Relevance rank should be specific to each searcher's perception of relevance.

See also my previous posts, "Perfect Search and the clickstream" and "My 2006 predictions".

Designing for geeks and missing the mainstream

Nicholas Carr posts a fun rant called "The 'Neat!' epidemic" that rips on the focus of Microsoft and others on flashy but complicated features that benefit few. An excerpt:

Microsoft ... Google, Amazon, and Yahoo, not to mention all the Web 2.0 mini businesses, seem intent on waging feature wars that mean a whole lot to a very few and nothing at all to everyone else.

At this point, the whole tired affair seems to point not to an overabundance of creativity but to a lack of imagination.

This reminds me of what I said in an earlier post comparing feature bloat in Microsoft Word with MSN's strategy for web search:

What happened to Microsoft Word as features were added for convenience?

It became a complicated mess, so feature rich that even a technogeek like me doesn't know or understand all the features. When I use MS Word, I spend most of my effort ignoring its features so I can get work done. The effort required to exploit its power exceeds the value received.

See also my previous posts, "Tyranny of choice and the long tail" and "The mainstream and saving people time".

The revenue model for Google WiFi

It might have been stupid of me, but I didn't quite grasp the details of how Google was going to pay for its free WiFi offerings until I saw this article by Kevin Newcomb. I think I understand it now.

The strategy is simple and elegant. Extend the IP geolocation Google already does for AdWords to include geolocation of WiFi hot spots, then target fine-grained localized advertising to the users.

For example, let's say I'm sitting in Dolores Park in San Francisco using Google's free WiFi. Google can recognize which WiFi node I'm using, pinpointing my location down to a couple hundred feet, and target ads on Google services to nearby businesses. I'm browsing through GMail, getting a little hungry, and perhaps one of the ads would be for the popular Dolores Park Cafe. I do a Google search for "flowers", one of the ads might be for buying flowers at the nearby Pay N Save Grocery.

The ads don't have to be in-your-face popups. Instead, existing advertising on Google services would be better, more targeted, more useful, and more interesting. Clever.

If I were Google, I'd start doing this immediately, before I even have local advertisers lined up. I'd experiment with pulling popular, highly rated, useful business listings out of Google Local and place them up instead of (or in addition to) one of the AdWords ads. After gathering the metrics, making the case to local advertisers should be straightforward, here's the numbers, here's the value you're missing.

[Found on Findory]

Update: The Google SF WiFi proposal (PDF) does hint at this approach, though it focuses on local advertising using geolocation for a start page (when you first join the network) and is vague about whether advertising across Google would be impacted.

Thursday, March 23, 2006

Early Amazon: 1996 holiday party

Right before I started at Amazon in Feb 1997, I managed to catch Amazon's 1996 holiday party.

The party had a backyard feel to it. We were drinking beers in the scruffy Seattle warehouse. A couple kegs sat in a corner. There was a band made up of warehouse employees. It was deeply Seattle, a bit of mellow fun.

I had just joined Amazon after leaving a repugnant Eastside company. This was exactly what I wanted. A little online bookstore, full of promise and excitement. I felt like I was home.

At the time, I had no idea that Amazon would get so big so fast. The growth was remarkable. But, even now, I look back at those first days with a deep fondness.

Feed readers and diminishing returns

Yahoo's Jeremy Zawodny is one of many early adopter geeks now feeling overloaded by their RSS reader. Jeremy says:

I'd simply read [feeds] until I was done reading. I had to make sure all those folders and feeds were un-bolded before I was "done."

And every once in a while, the periodic update would run before I finished, and I'd end up with even more work to do. The horror!

[I] realized that it's a classic example of diminishing returns. Since I almost always read that stuff in my own "priority order" I get the most bang for my time in the first 15 minutes or so.

Someday someone will pull all this ranking, customization, personalization, recommendation, and other magic technology together and give me a great reason to throw out my RSS Aggregator once at for all.

This reminds me of what I said back in Dec 2004 during the early days of Findory:

The problem with current web feed readers is that they don't solve the information overload problem.

Sure, I can pick and choose which RSS feeds I subscribe to. But, once you have tens of subscribed feeds, reading them becomes this cumbersome process. Click on a feed, skim the articles. Anything interesting in that one? No. Click, skim. Click, skim. Click, skim. Ugh.

Current RSS readers merely reformat XML for display. That isn't enough. They need to filter and prioritize. Show me what matters. Help me find what I need.

See also my earlier post, "A relevance rank for news and weblogs".

Wednesday, March 22, 2006

Early Amazon: Similarities

Amazon is well known for their fun feature "Customers who bought this also bought". It is a great way to discover related books.

Internally, that feature is called similarities. Using the feature repeatedly, hopping from detail page to detail page, is called similarity surfing.

A very sharp and experienced developer named Eric wrote the first version of similarities that made it out to the Amazon website. It was great working with Eric. I learned much from him over the years.

The first version of similarities was quite popular. But it had a problem, the Harry Potter problem.

Oh, yes, Harry Potter. Harry Potter is a runaway bestseller. Kids buy it. Adults buy it. Everyone buys it.

So, take a book, any book. If you look at all the customers who bought that book, then look at what other books they bought, rest assured, most of them have bought Harry Potter.

This kind of similarity is not very useful. If I'm looking at the book "The Psychology of Computer Programming", telling me that customers are also interested in Harry Potter is not helpful. Recommending "Peopleware" and "The Mythical Man Month", that is pretty helpful.

Solving this problem is not as easy as it might appear. Some of the more obvious solutions create other problems, some of which are more serious than the original.

After much experimentation, I discovered a new version of similarities that worked quite well. The similarities were non-obvious, helpful, and useful. Heck, while I was at it, I threw in some performance improvements as well. Very fun stuff.

When this new version of similarities hit the website, Jeff Bezos walked into my office and literally bowed before me. On his knees, he chanted, "I am not worthy, I am not worthy."

I didn't know what to say then, and I don't know what to say now. But that memory will stick with me forever.

Google Page Creator fills with crap

Philipp Lenssen reports that, just a month after launch, Google Page Creator has filled with spam and crap.

See also my previous post, "Google Base and getting the crap out".

[via Nathan and John]

Tuesday, March 21, 2006

A guide to personalized news sites

Mark Glaser at PBS MediaShift posts about websites that try to make "the perfect news page with everything you want to know."

While Mark says that "no one site has totally perfected it quite yet", his post is a guide to many sites that are trying. He includes reviews of My Yahoo, Google Personalized News, Topix.net, Netvibes, Gixo, and Findory.

Mark is right that the idea of a "Daily Me", a newspaper personalized to your tastes, has been around for more than a decade. Yet all these years later, we are still striving to make the technology meet the vision.

Monday, March 20, 2006

Google Finance and personalization

John Battelle reports on the launch of Google Finance, a stock and mutual fund information website with many of the features of Yahoo Finance.

I noticed that, in addition to the normal My Yahoo-like customization feature of being able to list a portfolio of stocks and funds, Google Finance appears to have a few nifty personalization features.

The stock quotes page (e.g. AMZN) has a section with "related companies", recommendations of other stocks that might be of interest. It is also interesting to note that both blog and mainstream news stories about a stock are included on this page.

The main page, finance.google.com, the page appears to change and adapt based on your history of viewing stock quotes. Specifically, any recent stock quotes you have done is listed prominently on this page, as well as a combined list of top news headlines for those stocks. Neat-o.

I have used My Yahoo and Yahoo Finance for several years. I have long wondered why they never have done implicit personalization. Looks like, once again, Yahoo has waited long enough that Google beat them to it.

Update: More on Google Finance from Matt Marshall and Danny Sullivan.

Update: AC Narendran and Katie Stanton have the post on the official Google Blog. They call it an "early-stage beta product."

Update: Reaction to Google Finance seems to be ranging from lukewarm to negative. Om Malik says, "Google Finance needs some muscle." Paul Kedrosky says, "All whiz, no bang."

Expectations are high. Google's "early beta product" doesn't have the wow yet.

Update: Jeremy Zawodny expresses frustration at the stagnant state of Yahoo Finance. Worth reading his thoughts.

Steve Yegge on interviews

Googler Steve Yegge has a fun blog rant on technical interviews. Well worth reading.

See also my previous post, "Early Amazon: Interviews".

[via Findory]

When AdSense goes bad

Eva Dominguez at Poynter Online writes about embarrassingly poorly targeted Google AdSense ads on some news stories:

Occasionally AdSense delivers unfortunate results when ads are inappropriate to the news.

Yesterday in Spain, the ombudsman of El Pais wrote of an online reader's complaint concerning the Google ads that ran alongside an article on immigrant trafficking, "The jump from the boat to the pirogue." This story reported how traffickers are increasing the size and power of their boats in order to earn more money.

Unfortunately, the adjacent Google ads touted mortgages for immigrants, European boat rentals and services for canoeists.

Online audiences are accustomed to the irony sometimes caused by automatic ad delivery -- but this case was especially annoying.

Uh, right, a news story on immigrant trafficking should not show ads that might be of interest to immigrant traffickers. Oopsie, wrong target audience there.

So, what ads should be shown?

We should try to show ads that target the type of people who read this kind of news story. The ads do not have to be about the news story, but they need to be likely to be interesting and useful to readers of this news story.

Contextual advertising does not do this. It is too literal. If a page is about immigrant trafficking, the ads will be about immigrant trafficking.

Findory has one approach at solving this problem. The personalized advertising engine at Findory targets AdSense ads not only to the content of the page, but also to what other Findory content has been interesting to the reader.

This allows Findory's advertising both to be deeper -- recognizing that one reader of a story about Yahoo might be interested in search and another reader more interested in advertising tools -- and less likely to be embarrassing -- avoiding ads for fire extinguishers when reading a story about a deadly inferno that killed 7.

Personalized advertising like Findory's is just one approach. We could do other things. Perhaps we might include learning which ads generally work best for articles about "immigrant trafficking", regardless of whether the ads are about that topic. Or, we might try to learn which ads interest the cluster of readers who read this article (or similar articles) on "immigrant trafficking".

Regardless of what path we pick, contextual advertising is not going to be sufficient for news. It works well for search because of the strong intent behind the search keywords. Interest in a news story shows much weaker intent, and we need to expand our search accordingly if we hope to find relevant, useful advertising.

Tuesday, March 14, 2006

Beyond the commons: Personalized web search

I probably started salvating when I bumped into this new Teevan et al. paper, "Beyond the Commons: Investigating the Value of Personalizing Web Search" (PDF).

Not only is it about personalized search, but also the authors include Susan Dumais and Eric Horvitz from Microsoft Research.

The paper describes a user study that showed wide variation for the same query in user intent and in user perception of the most relevant search result. That data is used to motivate personalized search as a way to better capture searcher intent.

Some excerpts:

Our analysis shows that rank and rating were not perfectly correlated.

While Web search engines do a good job of ranking results to maximize their users' global happiness, they do not do a very good job for specific individuals.

If everyone rated the same currently low-ranked documents as highly relevant, effort should be invested in improving the search engine’s algorithm to rank those results more highly, thus making everyone happier. However ... our study demonstrated a great deal of variation in their rating of results.

We found that people rated the same results differently because they had different information goals or intentions associated with the same queries.

Rather than improving the results to a particular query, we can obtain significant boosts by working to improve results to match the intentions behind it.

The paper then discusses providing users with explicit customization or more powerful search tools to explicitly specify their intent, but dismisses that approach as too much work, too difficult for searchers to do accurately, and too unlikely to be helpful:

One solution to ambiguity is to aid users in better specifying their interests and intents ... [Ask] users to build a profile of themselves ... [or] help users better express their informational goals through ... relevance feedback or query expansion.

While it appears people can learn to use these techniques ... they do not appear to improve overall success ... We agree with [Jakob] Nielsen, who cites the importance of not putting extra work on the users for personalization.

Even with additional work, it is not clear that users can be sufficiently expressive. Participants in our study had trouble fully expressing their intent even when asked ... In related work, people were found to prefer long search paths to expending the effort to fully specify their query.

The paper goes back at that point to looking toward implicit personalization to improve search, saying that inferring "users' information goals automatically" is the most promising way to disambiguate the "range of intentions that people associate with queries."

Great stuff. I am curious how much of this work at MSR is making it over to MSN.

See also my thoughts on the Teevan et al. SIGIR 2005 paper where the same authors describe a prototype of keyword-based approach to personalizing web search.

See also my previous post, "Attention and life hacking", discussing a NYT article on some of Eric Horvitz's work on attention and linking to some more of Eric's papers.

See also my earlier posts ([1] [2]) about Google's Personalized Search.

See also the announcement of Findory's alpha test of personalized web search.

See also my May 2004 post, "Why do personalized search?"

Popularity, entrenchment, and randomization

The paper "Shuffling a Stacked Deck" by Pandey et al. has an interesting discussion of the "entrenchment problem" of ranking based on popularity and suggests partial randomization as a solution.

From the paper:

Most Web search engines assume that popularity is closely correlated with quality, and rank results according to popularity.

Unfortunately, the correlation between popularity and quality is very weak for newly-created pages that have few visits and/or in-links.

Worse, the process by which new, high-quality pages accumulate popularity is actually inhibited by search engines. Since search engines ... always [list] highly popular pages at the top, and because users usually focus their attention on the top few results, newly-created but high-quality pages are "shut out."

The paper goes on to claim that their experimental results show that search engines should make 10% of the results randomly reranked items that would normally appear much lower in the search results. That, they say, results in enough exploration to eliminate the entrenchment problem and improves search quality.

However, while the results make sense for small document collections, I'm not sure whether their results easily apply to the real world of web search. Web search engines index tens of billions of documents. The simulations in this paper used a data set six orders of magnitude smaller, 10k pages.

That's a big problem. On a more realistically sized collection, you would expect randomization to be a lot more difficult because there would many, many more pages from which to randomly select. Most of those pages would be irrelevant. Big data makes this a much harder problem.

Nevertheless, I think the general idea of testing new items to determine their quality and value is a good one, particularly for systems ranking based on popularity or trying to make personalized recommendations.

On that note, I thought this tidbit in the paper was interesting:

The entrenchment problem may not be unique to the Web search engine context ... consider recommendation systems ...

Many users decide which items to view based on recommendations, but these systems make recommendations based on user evaluations of items they view. This circularity leads to the well-known cold-start problem, and is also likely to lead to entrenchment.

Indeed, Web search engines can be thought of as recommendation systems that recommend Web pages.

Exactly right. In fact, dealing with the cold start and entrenchment problems were some of the trickiest parts of the recommendation engine behind Findory. With news articles, there is a constant stream of new documents coming in to the system. Determining their quality, popularity, and relevance quickly is a challenging problem.

[Pandey paper via Paul Kedrosky]

Remote storage on Amazon S3

The Amazon web services team just launched Amazon S3, "a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web."

Michael Arrington at TechCrunch gushes about the new service, saying, "S3 changes the game entirely," and, "Entire classes of companies can be built on S3 that would not have been possible before."

Similarly, Mike at TechDirt says, "It's like Amazon just provided much of the database and middleware someone might need to develop a web-based app."

But, there's two obvious problems with building anything on top of S3: latency and reliability.

On latency, despite Amazon's marketing goo, any time you go across the internet to a remote machine to get data, you're looking at 100ms+ of latency. Compare that to a 1-3ms for local disk and effectively 0ms for local memory (or memory of local machines on a LAN) and you see the problem. There's no way you can make more than a couple data requests to S3 in the 0.5 - 1 seconds you have to serve your web page in real-time.

Reliability is the second problem. Amazon says the system is reliable -- "99.99% availability ... All failures must be tolerated ... without any downtime" -- but you can see if they're willing to stand behind that by looking at the legal guarantees on uptime. There are none. The licensing agreement says the service is provided "as is" and "as available".

I don't think it would be wise to use this for a serious, real-time system. As others have pointed out (in the comments to the TechCrunch post), it might be able to be used for an asynchronous product like online backups.

Even for online backups, there are problems. You must be willing to tolerate the lack of a guarantee of being able to recover your data once stored. And, with Amazon's fee of $15/month for storing 100G, it would be hard to add a surcharge to support your little online backup startup on top of Amazon's fees and still have a price point on your service that is attractive to customers.

Amazon S3 is an interesting idea. I think we will see some cool things implemented on top of S3, smaller projects by hobbyists, I'd think. But this is not a game changer for startups.

Update: While I do see a little support ([1] [2]) for being skeptical about Amazon S3, it is clear that I am swimming against the tide ([1] [2] [3] [4] [5] [6] [7] [8] [9]) on this one. Definitely worth reading the more optimistic folks and coming to your own conclusion.

Update: About two weeks after launch, Amazon S3 had a seven hour outage. As I said, the two obvious problems with building anything serious on top of web services like S3 are latency and reliability.

Update: Five months later, the CEO of SmugMug raves about S3. Even though they are only using it as backup, it is a great counter-example to the arguments I made above. [via New Media Hack]

Update: Nine months later, Amazon S3 has an extended outage that causes some to question the reliability of the service and the wisdom of using it for real-time applications.

Update: A year later, Don McAskill posts a presentation (PDF) with plenty of great details about SmugMug's experience using Amazon S3. They are mostly positive, though reliability and speed are concerns.

Monday, March 13, 2006

Erik Selberg on the future of search

Erik Selberg (creator of Metacrawler, UW CS PhD, now one of the brains behind MSN Search) has an interesting post up about Google, Yahoo, and MSN's strategies toward web search.

Erik starts by taking issue with my earlier post, "Different visions of the future of search", where I summarized Google, Yahoo, and MSN's strategies by saying:

MSN wants to give you more powerful tools. Yahoo wants the community of users to help improve search. Google wants computers to do all the work to get you what you need.

Erik says I got it wrong. From his post:

Yahoo is going down the content ownership path. The idea is to own the content -- whether it be licensed ... [or] their customers ... create content ... However, you have a huge cold-start problem ... and a huge spam problem.

Google and Microsoft have generally the same idea, although Microsoft has been slow to coming around to it. The old saying is that a computer will give you what you ask for, not what you want. Both Google and Microsoft are trying to give you what you want, not what you ask for.

The difference is in approach. Google, like some other companies like Apple, are fans of making things easy. How do you make things easy? Remove choice ... The Google homepage is a model for simplicity ... There's a big search box, and not much else. Hard to do something besides enter a query ...

Microsoft wants to make its products useful. And how do you make a product useful? It's all about features. That's why Live.com is just chock-full of random features, such as RSS feeds and weather and all the other normal portal goodies.

Greg got things a bit wrong in his article ... [MSN Live.com] isn't about changing what users do, but providing them with what they need to get their job done. If a simple search box will suffice, great. But sometimes other things are better suited, and Microsoft is looking at how to provide those as well.

Great thing for you? Search is gonna get better... much, much, much better. And you're all going to benefit. Gotta love it.

Erik is saying that both Google and MSN want to make the computer do the work for you. The difference, Erik says, is that Google does this by taking away features and MSN will do it by adding features.

I see his point, but I can't help but think of other Microsoft products. What happened to Microsoft Word as features were added for convenience? It became a complicated mess, so feature rich that even a technogeek like me doesn't know or understand all the features. When I use MS Word, I spend most of my effort ignoring its features so I can get work done. The effort required to exploit its power exceeds the value received.

Erik is optimistic that MSN Search will not go this route. I hope he's right. As usability guru Jakob Nielsen said:

Simplicity is rule #1 for usability ... Fewer features means that those features that do remain in the design will automatically be easier to understand because there are fewer other features to compete for the user's attention.

It is not easy to add features and power without overwhelming your users.

Erik is a sharp guy. As one of the key people working on MSN Search, his thoughts on MSN and the future of search are well worth reading. I only have a few excerpts of his post here. Don't miss reading his whole post.

See also my previous posts, "Customization in Windows Live Search" and "Personalized search at PC Forum".

Update: If you enjoyed this and want to hear more, Robert Scoble did a one hour video interview with Erik Selberg back in October 2005.

Update: Erik adds some more thoughts in a second post.

Sunday, March 12, 2006

Credibility, authority, and Wikipedia

Randall Stross at the New York Times covers some of the issues with reputation, reliability, and credibility in community-generated content, focusing mainly on Wikipedia.

Some excerpts:

The egalitarian nature of a system that accords equal votes to everyone in the "community" -- middle-school student and Nobel laureate alike -- has difficulty resolving intellectual disagreements.

Lay readers rely upon "secondary epistemic criteria," clues to the credibility of information when they do not have the expertise to judge the content.

[But] Wikipedia ... provides almost no clues for the typical article by which reliability can be appraised. A list of edits provides only screen names or, in the case of the anonymous editors, numerical Internet Protocol addresses. Wasn't yesterday's practice of attaching "Albert Einstein" to an article on "Space-Time" a bit more helpful than today's "71.240.205.101"?

In fact, Wikipedia has had to back off on the idea of allowing anyone anywhere to edit. Anonymous users can no longer create new articles on Wikipedia. And some (especially controversial) articles are protected and only allowed to be edited by some in the community.

This is not necessarily a bad thing. Slashdot, for example, allows but discounts anonymous comments and has filtering mechanisms to promote useful information. Wikipedia eventually probably will be forced to develop some kind of reputation system and favor authoritative, reliable contributors.

See also my earlier post, "Yahoo Answers and wisdom of the crowd", where I said:

A popularity contest isn't the best way of getting to the truth.

People don't know what they don't know. Majority vote doesn't work if people don't have the information they need to have an informed opinion.

See also my earlier post, "Summing collective ignorance".

Saturday, March 11, 2006

Big hopes for Windows Live

Olga Kharif at BusinessWeek reports on an internal memo they acquired where Microsoft SVP David Cole makes aggressive predictions for Windows Live and the timeframe for the success of Windows Live:

"Make no mistake, Windows Live is our strategic bet to change the game and win, while we grow and drive revenue with MSN.com," writes Cole, head of MSN and the Personal Services Group.

Windows Live Mail, the new version of Microsoft's flagship Hotmail e-mail, is hosting 750,000 users, and the company hopes it will host 20 million by June, according to the memo.

Windows Live Local search, customized to the user's geographic location, "is surpassing our competition with industry-leading technology," the memo says.

"Over the next 3-6 months, we'll ship more innovative technology into the marketplace than during our entire 10-year history," writes Cole.

However, Microsoft has a poor track record with these kinds of boastful predictions. For example, Steve Ballmer said in June 2005:

In the next six months, we'll catch Google in terms of relevancy.

Six months came and went.

Recently, Neil Holloway (President of Microsoft Europe) made a nearly identical statement to Ballmer's. After Neil's prediction caught flack in the press, MSN Search GM Ken Moss tried to backpedal and said that Microsoft is "humble", will "under promise" and "over deliver", and that they "won't forecast when [they] might take the lead."

It seems that new humility lasted 8 days.

Friday, March 10, 2006

RSS was designed by geeks for geeks

Simon Waldman (Director of Digital Publishing at the UK Guardian) talked about RSS and use of RSS by the public at large at the FT Digital Media Conference. A choice excerpt:

There are two distinctive views on RSS.

The first is that it is a fantastic technology that will empower web users [and] transform the way we get our news and information ...

The second is that ... it has been designed by geeks for geeks; it is too fiddly for normal human beings; and after you've finally worked out how to set up your RSS reader you rapidly find yourself with 5,000 unread articles, 200 photos from Flickr, and a few dozen podcasts that you will never get round to listening to. The net result is complete information overload: the very thing it was designed to eliminate.

In my opinion, both of these are true. Despite the smart software and web services available ... [it] is still too clunky for many users to adopt. It is still very much a minority sport, favoured by those - shall we say - with natural technical aptitude.

Stepping back a second, why are we exposing things called RSS, Atom, and XML to readers at all? Do they care what these data formats are? No, only geeks like us care. Mainstream readers just want to read news.

I think next-generation RSS readers will have to get past exposing RSS feeds. Readers just want to read news. All the magic of locating the content needs to be hidden. It all needs to just work.

See also my earlier posts, "Blog readers and RSS", "RSS for the mainstream", and "RSS sucks and information overload".

Google to open Seattle Fremont office

Todd Bishop and John Cook from the Seattle PI report that Google will soon open an office in Seattle's Fremont area (map), right across the street from the Red Door Alehouse.

This is a pretty big deal. For those of you who don't know Seattle well, commutes on the 520 bridge across Lake Washington are long and unpleasant. Microsoft's campus and the new Google Kirkland office are both across the 520. Many Seattle area software engineers I know are unwilling to move to the Eastside or do that commute.

While this is sure to cause poaching losses for Seattle-side software firms -- including Amazon.com, Real Networks, and startups -- it will also cause losses from Microsoft for those who can't stand to live on the Eastside and no longer want to endure that nasty commute.

For many years, I've wondered why Microsoft has resisted opening a development office in Seattle proper. They don't even run dedicated commuter busses between Seattle and Redmond (like Google and Yahoo do for San Francisco to Mountain View/Sunnyvale). It will be interesting to see if this move by Google forces Microsoft's hand.

See also my earlier posts, "Google's Kirkland office" and "Microsoft cuts benefits".

Update: John Cook at the Seattle PI reports ([1] [2]) that, at least for now, Google's Fremont office is mostly for sales, not software engineers.

Update: Fourteen months later, John Cook reports that Google finally will be opening an engineering office in Fremont in Seattle, leasing space from Getty Images.

Update: Eighteen months later, Brady Forrest at O'Reilly Radar posts a good discussion of the Seattle-Eastside split and how it impacts tech companies in the region.

Thursday, March 09, 2006

Growth, crap, and spam

There seems to be a repeating pattern with Web 2.0 sites. They start with great buzz and joy from an enthusiastic group of early adopters, then fill with crud and crap as they attract a wider, less idealistic, more mainstream audience.

Memeorandum appears to be the latest example of this. After gushing about Memeorandum in hundreds of posts, sending huge numbers of people to check out the site, Robert Scoble suddenly said he is signing off of Memeorandum, explaining that he's tired of what he sees now, little useful information, a lot of snarky articles from people seeking traffic.

Similarly, in the early days of Digg, it attracted praise as more interesting and useful than Slashdot. Traffic grew and, only a short time later, Digg started to fill with spam and crap. Russell Beattie had a good post about this where he said that Digg is "really full of crap" and that "posts that have more quality" are "getting lost in a continual din of rumor mongering [and] grandstanding."

How many times does this cycle have to repeat before people start building systems designed from the start to deal with bad behavior, crap, and spam?

Sure, you can get away without it while you're small. Spammers don't care about you when you're small; there's no profit motive. But, if you ever hope of building anything that works for the mainstream, you need can't assume everyone will play nice.

If your product tries to help people find stuff they need, you need to design from the start to surface the good stuff and filter out the crap.

See also Xeni Jardin's article in Wired, "Web 2.0 Cracks Start to Show", where she says, "When you invite the whole world to your party, inevitably someone pees in the beer."

See also my earlier posts, "Getting the crap out of user-generated content" and "Digg, spam, and most popular lists".

Update: Gabe Rivera, founder of Memeorandum, takes issue with my claims and says I mischaracterized Scoble's issues in the comments to this post. He makes good points, and it is well worth reading his thoughts.

Update: Six months later, the problem with spam on Digg gets worse.

Update: Nine months later, it appears my prediction about Memeorandum and spam did not come true. TechMeme appears to be focusing on mainly on high quality, more popular weblogs. These weblogs already have a lot of traffic and little incentive to manipulate TechMeme. This approach may exclude some of the long tail of blogs, but is effective at increasing quality and reducing spam.

Reuters CEO on attention

Reuters CEO Tom Glocer has some interesting tidbits on attention and personalization in his recent speech (MS Word) at the Online Publishing Association. A brief excerpt:

Consumers place value in others making decisions about what is good and what is not .... the skill to spot the gold in the pan of water and dirt.

None of us has the time to go searching through hundreds of sources to pull out what is interesting and what is relevant. Choice means letting the professionals do that for us, and sometimes letting the wisdom of crowds have a go.

For more of Tom's thoughts, see my earlier post, "Reuters CEO on personalized news", with excerpts from his March 2005 Financial Times conference speech.

[OPA speech found via Amy Gahran]

Andrei Broder on personalized information

Andrei Broder (former VP of Research at Altavista, former CTO of IBM Research, now VP of Emerging Search Technology at Yahoo) has some interesting thoughts on personalized information in an interview on the Yahoo Search Blog:

I believe that we are now entering an entirely new phase [in search]. I call this next phase "search without a box" ... information to come in a context without actively searching ... information on an "as needed, when needed" basis without explicitly asking for it.

Essentially we'’re going from 2.7 words per query to 0.

Andrei is talking about personalized information streams. Andrei says we should be showing people the relevant news, interesting new products, and useful new documents they need to see. We need to surface the information that matters automatically and implicitly.

See also my previous post, "Organizing chaos and information overload".

Making effort fun

Amy Jo Kim gave a talk at eTech called "Putting the Fun in Functional". Bruce Stewart wrote up notes on the talk.

Great tips stolen from game design. Amy convincingly argues that you can get users to do things that otherwise might be considered work by making the tasks fun, by making effort feel like play.

It's a good contrast and counterargument to one of my common themes, that any attempt to get users to do work are doomed to failure.

Wednesday, March 08, 2006

Customization in Windows Live Search

Chris Sherman at SEW posts a detailed review of some new features and enhancements at Windows Live Search.

Continuing the brand confusion between Windows Live and MSN, it appears that Windows Live Search is not intended as a replacement for MSN Search. This is despite the fact that, as Chris says, "the differences between Windows Live Search and MSN Search appear to be largely cosmetic," and that the two services use the same underlying technology.

The cosmetic changes in Windows Live Search appear to include new widgets to control your search and the display of your search. From Chris:

Sliders allow you to resize search results, adding or removing information such as descriptions, or in the case of images, making thumbnails larger or smaller.

"Macros" [are] a feature that lets you customize and save searches to be run again ... easy to create and share.

As I described in my earlier post, "Different visions of the future of search", it appears Microsoft's web search efforts are going down the path of trying to improving search by giving users lots of knobs and controls and then hoping they will use them.

Back in October 2004, MSN Search was about to launch. They were promoting their nifty sliders as a differentiating feature. At the time, I said:

I suspect sliders would be of interest only to power users, the kind of people who already use advanced search all the time. Most users want search to "just work". This kind of control is just a novelty, not something most would actually bother to use.

Search engines typically see less than 1% of searches using their advanced search interface. I wonder if MSN Search has seen much higher usage rates of their sliders, data that might cause them to believe that they can get the majority of users to bother with these knobs and dials.

Tuesday, March 07, 2006

Seller reviews in Google Base

Scot Wingo notices that Google allows seller ratings and reviews, an early reputation system, as part of their experiments with allowing buying and selling through Google Base.

See also my previous posts, "Froogle adds merchant reviews", "Google and eBay pursue sellers", and "Google vs. eBay?"

[Found on Findory]

Saturday, March 04, 2006

Scalable hierarchical models of the neocortex

Googler Corinna Cortes just posted a bunch of links on the Google Research Blog, including a link to talks various researchers have done at Google.

Following that, I discovered a talk that Brown University Professor Tom Dean did at Google in Jan 2006 called "Scalable Learning and Inference in Hierarchical Models of the Neocortex".

Eyes glazing over? Okay, right, that might be an intimidating title.

But, really, if you have any interest this kind of thing, check out the talk. It is a fascinating discussion of how parts of the brain work and techniques for pattern recognition inspired by neurological processes. Tom does a great job with the talk, so I think it should be reasonably accessible even if you have no background in this stuff.

Even if this kind of thing is old hat for you, you still should check out the talk starting at around 44:00. That's when Tom Dean starts talking about how to parallelize a hierarchical Bayesian network across a cluster of computers.

Of particular interest to me was that he presented MapReduce code for the computation and seemed to be arguing he could do very large scale Bayesian networks in parallel on the Google cluster. I was surprised by this -- I would have thought that the communication overhead would be a serious issue -- but Tom was claiming that the computation supported coarse-grained parallelism due to the hierarchical structure of the models.

If this really is true, it would be a fascinating application of the massive computational power of the Google cluster. Maybe it is my inner geek talking, but I'm drooling just thinking about it.

If you want to dig in more, Tom Dean also has three recent papers on this work. I haven't gotten to his new papers yet, but I will soon.

Update: I read the three papers. I'm quite a bit less excited now.

It seems like this work is further away from demonstrating interesting results at massive scale than I first thought. Experimental results focused on toy problems, like the handwriting recognition described in the talk, and only had modest success on that problem. Communication overhead in large networks appears to be significant -- as I suspected at first -- and it is not clear to me that this could run effectively at scale on the Google cluster.

It appears I may have been too hasty in getting so excited about this work.

Lighthouse and GDrive projects at Google?

Anyone know what the Lighthouse and GDrive projects at Google might be?

The PowerPoint slides from Google's recent analyst day mentioned Lighthouse and GDrive in the notes to slide 19:

With infinite storage, we can house all user files, including: emails, web history, pictures, bookmarks, etc and make it accessible from anywhere (any device, any platform, etc).

We already have efforts in this direction in terms of GDrive, GDS [Google Desktop Search], Lighthouse, but all of them face bandwidth and storage constraints today.

Another important implication of this theme is that storing 100% of a user's data makes each piece of data more valuable because it can be access across applications. For example: a user's Orkut profile has more value when it's accessible from Gmail (as addressbook), Lighthouse (as access list), etc.

Derrick highlighted these previously undisclosed projects in the comments to my earlier post.

Garett Rogers at ZDNet writes about the GDrive rumors today, saying:

The GDrive service will provide anyone (who trusts Google with their data) a universally accessible network share that spans across computers, operating systems and even devices.

Interesting. Back in August 2004, Richard Jones came up with a hack on top of GMail to use it for online storage. He called it GMailFS. Much speculation followed about whether Google would do a product for online storage.

I was skeptical about it at the time, but it sounds like Google may have at least done an internal project around the idea.

So, does anyone have any idea what Lighthouse might be?

Update: Richard MacManus at ZDNet also asks, "What is Google's Lighthouse?" He goes on to speculate that it might be "a security function that controls access to documents and folders ... a next-generation file search solution that 'shines a light' inside documents on your desktop."

Sounds like a pretty good guess. The latest version of Google Desktop Search (GDS) already allows you to store the search index for your documents on Google's servers. See

http://desktop.google.com/features.html#searchremote

Perhaps Lighthouse is an extension to Google Desktop Search that allows you to open and control access of other people to the remote copy of the GDS index and your desktop documents.

That would mean that, if you wanted access to your desktop files remotely, you would have two options: (1) Use GDS and Lighthouse (leaving the master copy on your desktop) or (2) Use GDrive to upload your files to Google (putting the master copy remote on GDrive).

That sounds about right. It's what I would build if I were at Google.

Update: Several months later, Philipp Lenssen posts screen shots from a leaked copy of GDrive (codenamed Platypus). It apparently replicates and synchronizes all your files across multiple machines. It is only available internally inside of Google.

Update: Nearly two years later, the WSJ reports that "a service that would let users store on its computers essentially all of the files they might keep ... could be released as early as a few months from now."

Thursday, March 02, 2006

In a world with infinite storage, bandwidth, and CPU power

Google is hosting an analyst day today. I found skimming the 94 slide presentation (~~PPT~~, PDF alternative) to be interesting and worthwhile.

In particular, I liked slide 19, 20, and 31, all of which makes it clear that Google isn't losing its wide-eyed optimism.

Slide 31 says that Google's philosophy to new product development is "no constraints" and that they initially ignore "CPU power, storage, bandwidth, and monetization."

Slide 20 says (in the notes) that Google plans to "get all the worlds information, not just some."

And slide 19 (in the notes) talks about how their work is inspired by the idea of "a world with infinite storage, bandwidth, and CPU power." They say that "the experience should really be instantaneous". They say that they should be able to "house all user files, including: emails, web history, pictures, bookmarks, etc and make it accessible from anywhere (any device, any platform, etc)" which leads to a world where "the online copy of your data will become your Golden Copy and your local-machine copy serves more like a cache". And, they say that they want "transparent personalization" that uses user "data to transparently optimize the user's experience ... implicitly."

Google also recommits to a future with personalized search. They say in the notes on slide 12 that they will "introduce new personalization elements" and that they view that as one of two major directions for their efforts to improve relevance rank.

Some might be inclined to dismiss all this talk as the wild fantasies of engineers with too much caffeine, but I think Google does see their ability to build out their massive cluster as one of their primary competitive advantages. I think they do intend to continuing extending their computing infrastructure until everyone everywhere really does feel that they have near infinite CPU power and storage at their fingertips.

[link to presentation via Paul Kedrosky]

Update: It appears Google suddenly removed the PPT file. Ugh. Well, sorry, but, unless you moved quickly, looks like there's no way to see it anymore.

Update: Google just made a PDF version of the slides available.

Unfortunately, this new PDF version of the slides no longer has the notes attached to each slide, so you can't see some of what I was referring to in my comments above.

However, I did download the original PPT presentation. Though I didn't keep a copy, I recently discovered that my Google Desktop cache does contain a text-only copy of notes for slide 12 and most of slide 19. The cached copy ends in the middle of the notes for slide 19.

Here are the notes from slide 12 with the reference to using personalized search to improve relevance rank:

Lead in Search
As the market leader, we need to ensure search doesn't become a commodity. Our focus on search is nothing new. We built our brand on being the best search engine, with the best results, and as our competitors have caught up to us, it's become even more important for us to focus on:
1) Speed
Solve international speed issues and bring international users to US performance
2) Comprehensiveness and freshness
"All webpages included in the Google index and searched all the time" -- Teragoogle makes this possible
Expand to other sources of data
Become the leader in geo search (any search with a geographic component).
New forms of content -- video, audio, offline printed materials
3) Relevance
Leverage implicit and explicit user feedback to improve popular and nav queries
Introduce new personalization elements
4) User Interface
Experiment with several new UI features to make the user experience better

And here are part of the notes from slide 19. Unfortunately, my cached copy ends right before the discussion of "transparent personalization" that I mentioned above:

In a world with infinite storage, bandwidth, and CPU power, here's what we could do with consumer products --
Theme 1: Speed
Seems simple, but should not be overlooked because impact is huge. Users don't realize how slow things are until they get something faster.
Users assume it takes time for a webpage to load, but the experience should really be instantaneous.
Gmail started to do this for webmail, but that's just a small first step. Infinite bandwidth will make this a reality for all applications.
Theme 2: Store 100% of User Data
With infinite storage, we can house all user files, including: emails, web history, pictures, bookmarks, etc and make it accessible from anywhere (any device, any platform, etc).
We already have efforts in this direction in terms of GDrive, GDS, Lighthouse, but all of them face bandwidth and storage constraints today. For example: Firefox team is working on server side stored state but they want to store only URLs rather than complete web pages for storage reasons. This theme will help us make the client less important (thin client, thick server model) which suits our strength vis-a-vis Microsoft and is also of great value to the user.
As we move toward the "Store 100%" reality, the online copy of your data will become your Golden Copy and your local-machine copy serves more like a cache. An important implication of this theme is that we can make your online copy more secure than it would be on your own machine.
Another important implication of this theme is that storing 100% of a user's data makes each piece of data more valuable because it can be access across applications. For example: a user's Orkut profile has more value when it's accessible from Gmail (as addressbook), Lighthouse (as access lis... [...TRUNCATED...]

Update: Derrick made the full notes for slide 19 available in the comments to this post.

Update: The full story about why the PPT version of these slides disappeared is now clear.

When I first posted a few excerpts from the notes to the slides, I had assumed that the notes were intended for the speakers of the presentation. I was annoyed and even a bit angry when the PPT was pulled, not fully comprehending why Google wouldn't want to make the notes generally available.

It now appears that many of the notes in the slides were cut-and-pasted from other presentations, never intended for Google Analyst Day. As mb points out in the comments to this post, the notes for slide 10 contain an odd reference to CBS, something I didn't notice when I originally was reviewing the slide deck.

Even worse, the notes to slide 14 contain revenue projections for next year, also something I didn't notice previously. Because Google published these projections to their website, even briefly, they were forced to file a 8-K with the SEC. In that filing, they say that the notes were "not speaker notes prepared for the Analyst Day presentation."

All very unfortunate.

Google's mission may be "to organize the world's information and make it universally accessible," but some information is not intended to be accessed by all.

Update: After waiting for the press storm to fade, Paul Kedrosky posts the original PPT file with the troublesome notes included.

Update: Nearly two years later, the WSJ reports that "a service that would let users store on its computers essentially all of the files they might keep ... could be released as early as a few months from now."

Personalization and Google's mobile strategy

A BusinessWeek article explains "Why Google's Going Mobile", including this tidbit on Google's "overarching mobile strategy":

The phone is not the PC. It's about creating the right experience for the mobile user, so they can find exactly what they want, quickly and efficiently.

People search differently on mobile phones; they don't browse as much, as PC users do, for example.

We also focus on personalization. The phone is a very personal device you don't share with your spouse or your children like you do the PC at home. How can we make this interaction even more personal?

See also my previous post, "More on Google Personalized Search".

[Found on Findory]

200k+ servers at Google and growing

As reported in the LA Times, Morgan Stanley analyst Mark Edelstone claims Google now has at least 200,000 servers in its cluster. Mark also says that Google will be switching to servers with AMD Opteron processors for most new purchases.

Looking at the AMD product line, I see that AMD has a new line of dual core (two processor), low power (55W) processors, the HE chips. Looks like a nice blend of reduced power and high performance there.

For more on why that's important to Google, see Googler Luiz Andre Barroso's discussion of heat, power, and performance issues in the Google cluster in his ACM paper, "The Price of Performance", and my discussion of that paper in my earlier post, "Power, performance, and Google".

[Found on Findory]

Update: See also my other post, "100k+ new servers per quarter at Google?"

Update: Three months later, John Markoff and Saul Hansell at the New York Times report that "the best guess is that Google now has more than 450,000 servers spread over at least 25 locations around the world." [via Don Dodge]

Update: Six months later, a paper on Google Bigtable mentions that Google is using machines with two dual-core Opteron 2 GHz chips. Sounds like Google is using the new AMD 2 GHz dual core Opteron HE.

Wednesday, March 01, 2006

Honeyd, the virtual honeypot

A fun paper by Googler Niels Provos, "A Virtual Honeypot Framework", has the clever idea of creating simulated networks and network hosts to detect hacking attempts:

One way to get early warnings of new vulnerabilities is to install and monitor computer systems on a network that we expect to be broken into. Every attempt to contact these systems via the network is suspect.

We call such a system a honeypot. If a honeypot is compromised, we study the vulnerability that was used to compromise it.

A physical honeypot is a real machine with its own IP address. A virtual honeypot is simulated by another machine that responds to network traffic ... Virtual honeypots are attractive because they require fewer computer systems, which reduces maintenance costs.

Honeyd is pretty clever, cheaply simulating network structures and the network stacks of various versions of operating systems.

Of course, you could do something like this with virtual machines -- running several operating systems and sandboxes on the same physical hardware -- but Honeyd is lightweight, allowing a single machine to present a complicated nest of many thousands of virtual targets to attackers.

By the way, if you like this kind of security goo as much as I do, you should check out Bruce Schneier's weblog, "Schneier on Security". It's a great resource.