Friday, April 30, 2004

Findory launches news by e-mail and RSS feeds

Findory News is doing a live beta test of daily e-mail delivery of Findory News and personalized RSS news feeds. Fun stuff.

Findory News by E-mail is like getting a personalized newspaper delivered to your front door daily, a newspaper with just the news you need to see.

With the personalized RSS feed, you can read Findory News on My Yahoo! or in any other RSS reader. Using services like FeedRoll or Jawfish, it's easy to post your personalized Findory News headlines on your blog or home page. Rather than put generic top headlines on your site, you can post a news feed that shows your readers your personalized headlines, the news that most interests you!

Google Cluster Architecture

Another good paper describing Google's cluster in a fair amount of detail.

One fascinating part of the paper describes the "docservers" that provide the snippet of the search result that contains your keyword. These docservers require "access to an online, low-latency copy of the entire Web. in fact, because of replication ..., Google stores dozens of copies of the Web on its servers."

While the docservers probably only need to store text copies of the page (no images, flash, html, etc.) and can compress the data, it's still a staggeringly large amount of data. Google quite literally stores a copy of the entire web to implement their snippet feature.

More on Google IPO

Of the many articles on Google's IPO today, Battelle's analysis is particularly insightful. The New York Times points out that the exact offering is e * $1B or $2,718,281,828. Definitely an unconventional company.

Thursday, April 29, 2004

A real personalized search from Google?

A Reuters article claims that, "Google is already working on a personalized search engine that learns what individuals like, find relevant or prefer to avoid." That's clearly not Google's current personalized prototype. Could they be building something more?

Latest Microsoft rant

Microsoft is truly remarkable sometimes. If you apply Service Pack 1 to WinXP (which you better do if you want to patch critical security holes), the patch includes a fix to a buffer overflow in zip folders. Turns out that fix introduces a new bug that causes any zip folders to fail to extract any file that contains a comma in the file name. This bug was discovered over a year ago and Microsoft still hasn't provided a fix. If your system is a current, patched version of WinXP, you have this bug on your machine.

Zip folders are commonly used for backups and data transfers. Hundreds of thousands of people must have been impacted by this bug, but it goes unrepaired. I discovered this bug when I restored some of my files from a zipped backup folder and later noticed several critical ones were missing.

It's truly remarkable that these kinds of problems go uncorrected by Microsoft. But an internal Microsoft memo sent to Bill Gates makes it clear why: "There is a huge switching cost to using a different operating system. It is this switching cost that has given customers the patience to stick with Windows through all our mistakes, our buggy drivers, our high TCO, our lack of a sexy version at times... It would be so much work to move over that they hope we just improve Windows rather than force them to move."

Craigslist: More with less

John Battelle has posted an interesting chart that shows that Craigslist is now in the top 25 for web traffic, all with only 14 employees. Impressive.

Wednesday, April 28, 2004

Media frenzy on Google IPO

964 news articles speculating on Google's IPO and counting. Isn't this a bit much?

Update: Google has announced their IPO. Interesting that they're using a non-traditional auction designed to allow the general public access to the IPO shares. This method should be better for almost everyone by generating a real market in the IPO shares instead of the more traditional technique of allocating the IPO shares to insiders.

Update: Excellent quote from Lawrence Ausubel from U of Maryland on the Google IPO dutch auction (from a NYT article).
    You should be relatively indifferent about winning or losing the I.P.O. auction, because if Google does what you expect and selects the I.P.O. price to be the true clearing price, you will have the option to buy at essentially the same price the next day.
Exactly right. The traditional method of pricing IPO shares underprices the shares, so investors are not indifferent to acquiring the IPO shares. If the shares are correctly priced, you shouldn't care if you get the shares at IPO or a few days afterwards.

Tuesday, April 27, 2004

Personalized search results from Google Alert

Google Alert is now offering "personalized search results":
    SightPoint Personalized Search Results

    SightPoint automatically learns which search results are most relevant to you. Your new search results are rated based on results you have clicked on in the past. These relevance ratings, out of five stars, become more accurate over time.

No idea how well it works, but it's an interesting development. Since Google is now offering Web Alerts itself, Google Alert does need some way to differentiate itself if it wants to survive.

Monday, April 26, 2004

Kill Google, Vol. 1

It was easy to speculate on how to kill eBay since I have no love for that company. But I do love Google. This blog has been a Google Love Fest ([1] [2] [3] [4] [5] [6] [7]). But even that which we love must die eventually. So the time has come to ask, what could kill Google?

Despite claims that Google is a monopoly, in fact it has less than 40% of the search market, nearly tied with MSN and Yahoo. While there is some evidence that customers are loyal to search engines, switching costs in this business are trivially low. To maintain or grow market share, Google's product needs to be superior to the other offerings in the marketplace. Google needs to innovate, creating a moving target for Yahoo and MSN and keeping ahead of the the hordes of startups nipping at Google's heels. Unlike Google, MSN and Yahoo do not have to innovate to win. They can simply duplicate Google's current product, then use their considerable marketing power to pull customers away from Google. If Google slows down, Microsoft will turn them into another Netscape.

The first threat to Google is internal. Google needs to maintain a culture that produces and delivers innovative new products. So far, Google has done this by hiring some of the brightest and most creative researchers in the world. But, as Google grows, having incredible people isn't enough. Communication becomes difficult in a large organization. Accountability drops, free riding increases. Great prototypes are developed, but never get out the door. People don't know who to contact and how to get things done. Google is well known for having nearly no management -- the controlled chaos of a research lab -- but, unless Google can adjust its organizational structure to its new size, the firm may find its innovation crushed under its own growth.

The second threat to Google is external. Google thrives on innovation, but another's innovation may out-Google Google. Clustering search engines such as Vivisimo may be one threat, but the technique hasn't proved extremely compelling yet. Many are working on question answering systems (e.g. Start from MIT), but Google, with some of the world experts in natural language processing, is well prepared to compete here. Personalization may be one area where Google has been slow to compete. Google Labs has a prototype of personalized search, but it's been widely criticized as ineffective, most likely because of the technique used. A9 is rumored to be developing personalized search. Yahoo's CEO has said that they intend to focus their innovation on personalization. And, while Microsoft has no offering in personalized search, it is taking some steps toward personalized news.

I love Google and would hate to see it fall. But change comes rapidly in internet technology. AltaVista also seemed invincible at its height, as did those before it, yet these companies faded as new techniques eclipsed the old. Google is not invincible. It too can die.

Google research papers

I've posted a few ([1] [2] [3]) papers from researchers now at Google. It's worth noting that Google has a list of some of the papers by researchers at Google, though it's somewhat out of date.

The IR section is particularly interesting, with papers on Craig Silverstein's Scatter-Gather clustering, one of several techniques for finding similar or related web pages, and even an attempt at a personalized newspaper (though it's more customization than personalization, more like My Yahoo than Findory News).

Friday, April 23, 2004

Diebold disaster

The controversy with electronic voting heats up. A California panel has recommend decertifying Diebold and investigating civil and criminal charges against the firm. Update: A compromise has been reached.

Ethanol from waste

More progress in getting energy from waste, this time waste straw, wood, and paper into ethanol. Interesting since current ethanol production from corn uses about as much energy (in growing and processing the corn) as it outputs, so using waste material is an attractive alternative.

Thursday, April 22, 2004

Kill eBay, Vol. 1

eBay has announced another good quarter. Is this flea market turned shopping superstore really invincible?

Those who argue that a natural monopoly exists in online auctions claim that the auction site with the most users always is better. Buyers want selection from a large number of sellers and competition between sellers to drive down prices. Sellers want a large number of buyers to get the best price and drive high sales.

Others have argued that any site that has a critical mass of buyers and sellers is competitive. Growing the user base hits a point of diminishing returns. Sufficient sales volumes are sufficient sales volumes. The bar is much lower with this model. You don't have to be bigger than eBay, you just need critical mass.

Whether you believe the natural monopoly, bigger-is-always-better theory or the critical mass theory, there is a way to attack eBay. It is not necessary to be the biggest auction site overall. When a buyer is looking for a used copy of a Harry Potter DVD, all that matters is the number of sellers of the Harry Potter DVD. A competing auction site could focus on a particular category, perhaps computer hardware, CDs, or DVDs, and build scale in this category. Acquire more buyers and sellers in this niche, dominate the category, then move on to the next, chipping away eBay piece by piece. Selling liquidation merchandise might be one way to bootstrap, attracting buyers to a relatively low volume site with low prices. Auctions failed because it arrogantly attempted to attack eBay head on. Had Amazon focused on dominating specific categories such as books, music, and video, we might be looking at a very different situation today.

What is GMail?

Most discussions talk about GMail as just another web-based free e-mail address that happens to have 1G of storage. But that isn't it at all.

GMail asks and answers the question, "What e-mail client would you build if you never had to delete any of your old e-mail?" GMail is designed to organize your information for easy access later. Messages are threaded, part of a conversation on a topic. Searching your mail is emphasized. And, because it's web-based, you can access you mail and any information in your mail from any computer. The 1G of storage is just a means to the end.

Tuesday, April 20, 2004 web site recommendations is one of the more interesting web page recommendation systems I've seen. Some of the novel features: No download required. No login required. Doesn't require rating tens of sites before seeing recommendations.

Using it is quite easy. Just drag-and-drop the Javascript bookmark icon into the bookmark bar of your browser. Go to a few of your favorite web sites and click the "Spurl!" bookmark link on each of them. Then, go back to to see your web site recommendations.

The user base appears to be very small right now. The site appears to try to generate related sites even when only a couple users have seen a site, so the quality is sometimes laughably bad, but the quality is higher on more popular URLs. It'll be interesting to watch this site as the user base grows.

The Future of Google Directory

One of the changes in Google's recent redesign was removing Google Directory from the front page. Google Directory is based on DMOZ ODP, but Google has been very slow to update their data over the last couple years. Despite rumors that Google was going to abandon ODP, it appears they're sticking with it for now.

DMOZ and Yahoo both use human editors, a solution that has obvious scalability problems. With Google's deemphasis of DMOZ, one might guess that they're working on their own automated classification of web pages, especially given their in-house expertise in this area. has taken a similar approach and done an excellent job automating the categorization of news.

Saturday, April 17, 2004

Innovation == Google

The I, Cringely column this week is on competing with Microsoft, When it comes to tallking about Google, he says:

Google has a business plan that includes the almost constant introduction of new products. They are not afraid to launch 10 services. They are not afraid if a few turn out to be flops. Finding great new product ideas for Google is a statistical process. Google is investing in good people and is letting them be creative. They are letting them think and act on their ideas. This is scary for Microsoft, which finds itself continually in reaction mode and never quite getting enough up to speed to be a real player before Google makes another change.

Have you noticed that most of the innovation these days seems to be coming from Google? 1G of free e-mail. Personalized search. Shopping metasearch. News and News Alerts. Web Alerts. Localized web search. Social networking. Where is Microsoft? Yahoo? Amazon?

Clustered databases

MySQL is offering a clustered database.

The slow progress in clustered databases is a bit surprising to me. The holy grail of clustered databases is to be able to have a virtual database that appears to be on one machine but actually is distributed across a cluster of servers. Failover should be immediate to another machine in the same cluster. Data should be replicated on demand. Performance should be high. Yet no one provides this solution, at least not at a level where the cluster is truly invisible to clients. True, the data consistency issues are much more challenging for a relational database than the Google File System, but this system needs to be built.

Friday, April 16, 2004

Tim O'Reilly on GMail

Tim O'Reilly has an excellent column on GMail and the privacy flap over GMail.

Amazon's collaborative filtering applied to web search

Interesting comment on Battelle Media (excerpted from Rex's blog):

If A9 incorporates the collaborative filtering algorithms that power Amazon's predictive recommendations to customers, it will (and I know this from very, very expensive first-hand experience) produce search results that will astound the user. Just think about it: Your search results will be filtered first by Google algorithms and then through Amazon's collaborative filtering algorithms.

This oversimplifies the difficulties with personalizing web search, but it's definitely on the right track.

I have little doubt that A9 will try to do personalized search. But I have my doubts about whether they will succeed. The team at A9 has expertise in search, not personalization. The people who developed's collaborative filtering algorithms are not at A9.

Thursday, April 15, 2004

A9's "personalized search"

A9's new search does allow you to view your search history and previously viewed search results. But it is not personalized. Personalized search means that the search results differ depending on who you are.

A9's search could be personalized using this history data. For example, if Google used search history instead of an explicitly specified profile for their personalized search, they would have a search engine that personalizes based on your history. Of course, it's not quite as simple as that. Since Google's personalization search technology relies on a coarse-grained technique, the quality of the personalized results would remain low. It'll be interesting to see if A9 can find a better approach.

Wednesday, April 14, 2004


A9's new search engine has some interesting features. It tracks your history, among other things. Could be the first step toward personalized search.

Google File System

Interesting paper on GFS, Google's scalable, replicated, distributed file system. The paper describe a lot of clever optimizations to their environment and workload. Unfortunately, it doesn't include performance comparisons with alternative approaches, so it's a bit difficult to evaluate.

Tuesday, April 13, 2004 interview

Interesting interview of CEO Rich Skrenta. He discusses the "Robo-Editor" categorization technology at a high level.

Monday, April 12, 2004

My Yahoo and recommended RSS feeds

My Yahoo appears to have taken a first step toward personalized news. If you add the RSS feed beta to your My Yahoo content (use the drop down menu at the bottom of the page), then add a couple feeds (they have a convenient keyword search to help find them), they'll give you an option to see other feeds that might be of interest based on the feeds you selected. The recommendations seem fairly good.

Not much, but it's a first step. The question remains, why bother exposing the RSS feeds at all? Do people really want to pick RSS feeds for tens or hundreds of sources? Or do they just want to read the news that is most relevant to their interests?

Acadamic papers search

Google appears to be preparing to offer search of scholarly papers. For CS articles, CiteSeer is the best academic paper search around. Steve Lawrence (creator of CiteSeer) is now at Google.

More on GMail

A UI designer at Google has posted some screenshots and other information about GMail. Also, an interesting comment from a GMail beta tester. And Forbes just released a review of GMail.

Seattle Times on Findory

More press coverage of Findory News.

Sunday, April 11, 2004

Beyond personalized search

Humorous Onion article on Yahoo's "Soul Search" engine.

[Update] Apparently, folks at Yahoo read The Onion. Check out what now happens if you do a search for "What is my destiny" on Yahoo.

Saturday, April 10, 2004

The Human Equation

If you read one book on management this year, I can't recommend The Human Equation highly enough. It convincingly argues for investing heavily in your employees, an approach that seems out of vogue these days. The evidence seems to show that the cost of this investment is returned many times over in increased productivity and reduced turnover.

Investing in your employees means treating your employees as partners in the company, not contractors to be fired at will. Hiring should be selective. Training and cross-training should be routine. Teams should be self-managed, information widely and freely shared, and decisions decentralized. Wages should be well above average, tied to organizational performance, and relatively flat across the organization.

It's a compelling thesis. While I've seen others claim that some business models don't require investing in people, Pfeffer skillfully argues the advantages of this approach across all industries.

Friday, April 09, 2004

Thursday, April 08, 2004

Ads in RSS

As RSS becomes increasingly popular, advertisers are considering putting ads in the feeds.

This is a poor move. RSS is a method of driving traffic (and advertising dollars) to a website. RSS feeds contain only the title and a short blurb for a article; you have to clickthrough to read the article.

But, when you embed ads in the RSS feed, you reduce the value of that feed and reduce its ability to drive traffic to your site. Any small advertising revenue you make from the RSS ads will be swamped by the ad revenue lost from reducing traffic to your site.

Monday, April 05, 2004

Google's platform

Interesting speculation on the architecture behind Google's GMail.

"It's a distributed computing platform that can manage web-scale datasets on 100,000 node server clusters. It includes a petabyte, distributed, fault tolerant filesystem, distributed RPC code, probably network shared memory and process migration. And a datacenter management system which lets a handful of ops engineers effectively run 100,000 servers."

Can it be true? It's the holy grail of distributed computing. And Google has already built it?

Sunday, April 04, 2004

Google e-mail

A supposed screenshot of Google's new free e-mail. With 1G of storage, all for free, this is sure to be brutal competition for Yahoo Mail and MSN's HotMail.

As Google is increasingly attacked by Yahoo and MSN on search, it's a great strategy for them to counterattack in other areas like news (Google News), shopping (Froogle), and e-mail (GMail). It puts MSN and Yahoo on the defensive, distracts them, and forces them to divert resources. Good move, Google.

Thursday, April 01, 2004

Google Copernicus Center is hiring

Amusing April Fool's Joke from Google. "Interviewing candidates for engineering positions at our lunar hosting and research center, opening late in the spring of 2007."