Wednesday, December 28, 2005

Attensa funded with $12M

Attensa, a company that is "developing an end-to-end RSS Network that automatically and intelligently delivers prioritized, relevant RSS information," recently received a second round of funding, bringing its total backing to $12M. Wowsers.

Here's what Attensa says it will be doing with all that cash:
Getting to Less is More with Article Level Intelligence

By intelligently analyzing information about RSS articles and how readers are interacting with the articles, the Attensa RSS network can deliver more relevant, timely information ...

Using Attensa network attention streams that accommodate the Attention.xml standard, metadata is ... triangulated through collaborative filtering to deliver the most relevant information.

By sharing, aggregating and triangulating the attention streams (anonymously and in near real-time) generated by the millions of people using RSS feeds ... [Attensa will] create privacy protected anonymous user profiles, based on permission, that can recommend content, refine blog and Website searching, and enhance the experience of tracking the news that matters ...
Perhaps the personalization bubble has already started.

But, from what I can tell, this company seems to have cast its net far and wide, saying they'll be doing RSS for enterprise (like Newsgator), an Outlook-based feed reader (like Newsgator and soon Microsoft), metrics for publishers (like FeedBurner), popular posts (like Digg), article clustering (like Memeorandum), tagging (like del.icio.us), and recommended articles (like Findory). Who knows, with $12M in funding, perhaps they'll succeed in tackling this laundry list.

If you're interested, Attensa's VP of R&D, Eric Hayes, has a blog. A couple months ago, Eric and I had a short thread about the difficulty of building scalable recommendation systems.

Monday, December 26, 2005

A political lens on information

Mark Cuban posts a truly frightening prediction, a world where people use different tools to find information because they don't want to see any information that conflicts with their preconceived opinions:
I have zero doubt that in the future there will be sliders or some equivalent that represent "the [political] flavor" of search that users will look for.

Looking for information about the war in Iraq... push the slide rule to the right till you reach Bill O'Reilly flavored search, or slide it to the left for the Al Franken flavor. The results are then influenced by the brand you prefer to associate with.

The news is no longer just the news ... A search result will no longer just be a search result.

The Web 3.0 - You stay on your side of the web and I will stay on mine.
Can this possibly be true? Are people so afraid of being wrong that they will ignore conflicting information?

Unfortunately, I've seen some of this myself at Findory. Especially around the 2004 elections, Findory received a few pretty remarkable hate mails. This is one, from someone clearly deep in the bowels of the right wing, is one of the most extreme:
To: corporate@findory.com
Subject: lefty

YOUR TOO LEFT WING!!!!!!!!!!!!!!!!!!!!!!!!!!

DROP DEAD AND THEN BECOME AN AMERICAN AGAIN. OR JUST MOVE AND LIVE IN THE GLORIOUS WORLD OF MAKE BELIEVE IN EUROPE.

GOOD-BYE ENJOY YOUR VOYAGE.
We also get accusations of bias coming from the left:
To: suggestions@findory.com
Subject: Foo...

Articles history gets arbitrarily flushed from time to time.
No clustering, articles related to the very same topic are repeated and clutter the page space "real estate".
Categorisation is sometimes hapazard, articles showing under the wrong heading and "personalized" topics gathering disconnected subjects.

Yet, you manage to introduce a right-wing bias!

I suggest you use this effort and cleverness to improve the basic product instead...
Since Findory crawls thousands of sources around the world -- some considered to be conservative, some considered to be liberal, most considered to be moderate -- I've been a bit surprised by these comments.

There is a temptation to dismiss these as ravings from the lunatic fringe, but I've been curious about where these people see bias. Even with the most hateful of these e-mails, a calm response usually works well, and I've often been able to discover why a few customers feel so strongly that there is bias one way or the other.

The answer is disturbing. Findory is specifically designed to ignore political biases when recommending articles. If you read a right or a left-leaning opinion article on the Iraq War, you will be recommended other articles on the war and issues surrounding the war, some right-leaning, some left-leaning.

The idea is to avoid pigeonholing, to show people views from across the spectrum, to give people the information they need to make an informed judgment.

For some, that is exactly the problem. They don't want to see both sides. They want a filter, a political lens. As they see it, reading an opinion article on the left should only give them other opinion articles on the left (or visa-versa), reinforcing the opinion they already have.

They don't want discovery. They don't want new information. They don't want to learn. They want to be pigeonholed.

And this is why I find Mark Cuban's post so frightening. If he is correct, what I've seen as a radical fringe, a few people way outside the mainstream, is actually the majority view. Mark sees a world where information is not true or false, but left or right:
This process of continuous alteration was applied not only to newspapers, but to books, periodicals, pamphlets, posters, leaflets, films, sound-tracks, cartoons, photographs -- to every kind of literature or documentation which might conceivably hold any political or ideological significance. Day by day and almost minute by minute the past was brought up to date.

In this way every prediction made by the Party could be shown by documentary evidence to have been correct; nor was any item of news, or any expression of opinion, which conflicted with the needs of the moment, ever allowed to remain on record. All history was a palimpsest, scraped clean and reinscribed exactly as often as was necessary.


- 1984 by George Orwell
That world must not be allowed to come to pass. Information must be free.

Saturday, December 24, 2005

Recommended research papers

I keep getting requests to recommend papers in personalization, recommendations, and information retrieval, mostly from students.

I've been responding to these individually, but that seems inefficient, so I decided to go ahead and post a list of a few of my favorites.

The focus in this list is on breadth, mostly surveys that provide a good introduction, mostly work that used very large data sets. Follow citations on Citeseer if you want to explore in more depth.Enjoy! I hope it makes for interesting reading over the holidays.

Friday, December 23, 2005

My 2006 predictions

I've seen several good posts ([1] [2] [3] [4]) already with predictions for 2006. I thought I'd throw my thoughts out there too.

My predictions for 2006:

After putting Google on a pedestal, the press will start knocking it down. A firestorm of bad press will undermine the pillars of hype that support Google's lofty stock price, but the negativity will not be justified by any noticeable weakness in Google's business.

Yahoo will double down on their bets in community and social networking, including buying at least two more startups working in the area. Results of their efforts will be mixed, popular among early adopters, but largely a dud for the mainstream.

Microsoft will launch an AdSense-like advertising product in the hopes of undermining Google's business, but the product will fail to attract a large network in 2006 due to relatively weak ad targeting and low clickthrough rates.

MSN Search will increase market share, but only modestly in 2006. Other search engines will not move noticeably. Searchers will continue to view Google as having the best search results, whether or not that perception is accurate.

Microsoft will abandon Windows Live.

Tagging documents (My Web 2.0, del.icio.us, tag search of documents) will fail to attract mainstream interest. Tagging will continue to be popular for photos, videos, and other items with poor metadata.

Flickr, Technorati, del.icio.us, and other popular tagging sites will find themselves under assault by spammers. Like with splogs, efforts to battle the influx of crap will be only partially successful.

Wikipedia will be sabotaged by a spam robot coming over a botnet. The spam robot will makes millions of subtle, small changes to the articles, many of which go undetected for long periods of time. Unable to keep up, Wikipedia will be forced to shut off anonymous edits and place other controls on changes.

Yahoo and MSN finally will launch blog search. Google Blog Search will grab majority market share anyway. Technorati, Feedster, and other blog search pure play startups will struggle.

The massive power of Google's cluster will be demonstrated in a much more ambitious version of Google Q&A (currently a modest experiment with automated knowledge extraction of answers from the Web). It will be well received. The launch will send the other search giants, who have been favoring simpler canned shortcuts instead, into a panic.

Interest in attention and personalization of information will grow as searchers become increasingly desperate for an easy way to surface the good stuff from all the crap out there. We'll see many new startups offering personalization products, most of which will be peddling junk. The hype will attract VCs. They will follow each other on in, bleating joyfully as they shower investment capital indiscriminately on good and bad alike.

Google will add an experiment with personalized news to Google News and expand on their personalized search. MSN and Yahoo will experiment with personalization and recommendations in news, search, and shopping. All three will experiment with highly targeted advertising using your search and browsing clickstream.

The hype about mashups and APIs will fade as more and more developers are frustrated by crippled APIs, lack of service quality guarantees, and lack of bargaining power in negotiations for commercial use of the APIs.

As their own business slows, eBay will make other large acquisitions in an effort to buy growth.

Update: Some good discussion in the comments on this post, especially on the Windows Live prediction.

Wednesday, December 21, 2005

Making the impossible possible

Google Earth CTO Michael Jones spoke at UCSD recently. A massive 150M video of the talk is available.

The talk is mostly a demo of Google Earth, focused on showing how all kinds of user-contributed geographically tagged data can be integrated into Google Earth.

But one part of the talk I found particularly insightful was when Michael mentioned Nobel prize winner Tjalling Koopmans and commented on Tjalling's view that new tools enable new problems to be solved:
Your perception of a thing that is a viable problem to think about is shaped by the tool you can use.

If I wanted to build a swimming pool and I had a spoon, I wouldn't think about doing it. If had a backhoe...

If we look at tools, we discover they have a life of their own. People are shaped by their tools.

Sometimes the solution to important problems ... [is] just waiting for the tool. Once this tool comes, everyone just flips in their head.
Michael was applying this to Google Earth -- that Google Earth is a tool that enables things that were not easily possible to do before -- but I think this is an insightful point about a lot of Google's work.

The goal is to build tools that enable people to find and analyze information orders of magnitude faster than before. This opens the door to attacking problems that before were prohibitively difficult to solve.

This is true of the Google search itself, the first tool many people turn to when they have a question about anything. This is true of the Google, Yahoo, MSN, and Amazon APIs, which allow people to rapidly prototype clever mashups demonstrating new ways to solve problems. This is true of Google's internal tools Sawzall and MapReduce, tools that are "major force multipliers" by allowing parallel data processing at an unprecedented scale on the Google cluster.

Problems that were difficult or impossible to solve before are becoming practical as new tools are created for processing information. It is an exciting time. Vast opportunities lie before us.

[video via Paul Kedrosky]

Monday, December 19, 2005

The folly of ignoring scaling

David Heinemeier at 37 Signals (and creator of Ruby on Rails) wrote what I thought of as a pretty extreme post two weeks ago, "Don't scale", that argues that startups should ignore scaling and performance.

Ironically, in the following two weeks, many popular Web 2.0 startups have had problems, including a multi-day outage at del.icio.us, an 18+ hour outage at SixApart's blogging service Typepad, performance that has "sucked eggs" at Bloglines, and, as GrabPerf reports, slowness and outages at Technorati, Feedster, BlogPulse, BlogDigger, and Digg.

Stepping back for a second, a toned down version of David's argument is clearly correct. A company should focus on users first and infrastructure second. The architecture, the software, the hardware cluster, these are just tools. They serve a purpose, to help users, and have little value on their own.

But this extreme argument that scaling and performance don't matter is clearly wrong. People don't like to wait and they don't like outages. Getting what people need quickly and reliably is an important part of the user experience. Scaling does matter.

See also Om Malik's post, "The Web 2.0 hit by outages".

Update: Several months later, one of the first blog search engines, Daypop, goes offline because of scaling issues. The site says, "Daypop no longer has enough memory to calculate the Top 40 and other Top pages ... Daypop won't be back up until a new search/analysis engine is in place." Daypop has been down for a few months since this message was posted.

Update: Sixteen months later, in an interview, Twitter Developer Alex Payne says:
Twitter is the biggest Rails site on the net right now. Running on Rails has forced us to deal with scaling issues - issues that any growing site eventually contends with - far sooner than I think we would on another framework.

The common wisdom in the Rails community at this time is that scaling Rails is a matter of cost: just throw more CPUs at it. The problem is that more instances of Rails (running as part of a Mongrel cluster, in our case) means more requests to your database. At this point in time there's no facility in Rails to talk to more than one database at a time.

The solutions to this are caching the hell out of everything and setting up multiple read-only slave databases, neither of which are quick fixes to implement. So it's not just cost, it's time, and time is that much more precious when people can['t] reach your site.

None of these scaling approaches are as fun and easy as developing for Rails. All the convenience methods and syntactical sugar that makes Rails such a pleasure for coders ends up being absolutely punishing, performance-wise. Once you hit a certain threshold of traffic, either you need to strip out all the costly neat stuff that Rails does for you (RJS, ActiveRecord, ActiveSupport, etc.) or move the slow parts of your application out of Rails, or both.

Thursday, December 15, 2005

People are lazy

I love Paul Kedrosky's recent post about the three reasons trying to "change the world on the back of altered user behavior" will fail:
1. People are lazy
2. People are lazy
3. People are lazy
Paul goes on to say that "intelligence belongs in the network and in the algorithms" and "relying on users to do the heavy lifting -- however intellectually appealing -- is not going to work in the real world of lazy users who see little in it for them."

People are lazy, appropriately so. If you ask them to do work, most of them won't do it. From their point of view, you're only of value to them if you save them time.

If any work is going to be done, it's going to have to be done by a computer, not a person. People expect you to just make the right thing happen.

This is why Findory works the way that it does. No login, no configuration. Just read articles. The site learns from the articles you read and recommends other articles. The computer does all the work. It is simple, easy, and helpful.

See also my previous post, "Personalized search at PC Forum", where I describe the debate between A9 CEO Udi Manber, who claims searchers need to learn how to use more powerful tools, and Google's Marissa Mayer, who says people just want to quickly and easily get the information they need.

Wednesday, December 14, 2005

The money in the long tail

David Hornik at VentureBlog posts his conclusions about where to find value in the long tail.

Some excerpts:
There are essentially two general classes of technology the will benefit economically from the Long Tail -- aggregators and filterers.

The aggregators are those web businesses that seek to collect up as much of the Long Tail content as is possible, so as to make their "stores" a one stop shop for content no matter how popular or obscure.

The filterers are those businesses that make it easier to find the content in which we are interested ... The beneficiary of the filtering is the end user and the filterer, not the content owner per se.

I believe that it is difficult to be an aggregator without also being a filterer ... Aggregators ... [must] come up with their own clever filtering mechanisms to help consumers fully appreciate and navigate the breadth of the content they have to offer.

I think it is helpful for venture capitalists and entrepreneurs alike to focus on where the money is in the Tail. The real money is in aggregation and filtering and those will continue to be interesting businesses for the foreseeable future.
Gather up the long tail content, then filter. Help people find what they need.

Massive selection isn't enough. To make the long tail accessible, irrelevant items should be hidden. Interesting items should be emphasized. Millions of poor choices should be reduced to tens of good ones. The value is in surfacing the gems from the sea of noise.

See also my earlier posts, "Personalization and the long tail" and "Profiting from the long tail".

Tuesday, December 13, 2005

Kill Google, Vol. 2

According to a NYT article, Bill Gates was asked last month, "Will you do to Google what you did to Netscape?":
Mr. Gates, the Microsoft co-founder and chairman, paused, looked down at his folded hands and smiled broadly, as if enjoying a private joke. "Nah," he replied, "we'll do something different."
And what would that something different be? The article goes on to suggest that it will be web services, but I think it will be going after Google's lifeblood, advertising.

Geeks like me think of Google as a search company, but most biz folks I talk to view Google as an advertising company. It is the ads that generate the revenue. It is the ads that allow everything else to happen.

The AdSense revenues -- revenues from ads placed on other sites -- may be particularly vulnerable to attack. This was 43% of Google's revenue in Q3 2005. With these ads, the owner of the site gets roughly 70% of the revenue from the ad. Google takes the other 30%.

It seems like Microsoft could do a fair amount of damage here by trying to drive the share the advertising engine takes in this deal to near zero. To do that, it just needs to launch its own AdSense-like product and be willing to set its take to its breakeven point.

There's some indications that Microsoft may be planning to do this. Bill Gates pointed out that Google makes a lot of money from advertising and then scolded Google for keeping all of the advertising money for itself. Nicholas Carr recently wrote that "the wide profit margins Google enjoys on internet advertising are unsustainable" and "competition, from Yahoo and Microsoft as well as others, can be expected to reduce the profits." And MSN just announced a pilot of an AdSense-like product called AdCenter.

However, there is a big assumption here, that other advertising engines can generate the same revenue as AdSense. As long as Google's clickthrough rates are roughly 30% higher, it will be impossible for anyone else to drive Google's share of the revenue to zero.

And that is Google's defense. Focus on relevance. If Google can maintain its lead on relevance, if it can maintain higher clickthrough rates, if it can continue to generate more revenue for sites using AdSense, it is not vulnerable.

See also "Kill Google, Vol. 1" where I focus on the dangers of from growing too fast and from failing to innovate quickly.

See also "Kill eBay, Vol. 1" and "Kill eBay, Vol. 2".

Update: After the March 2006 analyst day, Google posted slides that said in the notes:
AdSense margins will be squeezed in 2006 and beyond. Y! and MSN will do un-economic things to grow share.
Google expects Yahoo and Microsoft to attack them using this strategy.

AWSP offers shell access at Alexa?

I thought Amazon Mechanical Turk was one of the strangest things I've seen in a while, but Amazon is weirding me out again with their new Amazon Web Search Platform (AWSP).

AWSP is supposed to be a developer framework to innovate on top of the crawl and index data available from Alexa. As part of this package, it appears the AWSP offers ssh access to the Alexa cluster where you can write arbitrary C code.

This is either incredibly bold or absurdly foolish. On the one hand, this could be a useful platform for some developers, a utility computing server farm where you can rent machines by the CPU hour and access the incredible Web data available from Alexa. On the other hand, arbitrary C code can do arbitrary things, nicely accessing the data it is supposed to or evilly cracking the machine, fondling other people's data, and launching attacks on other servers.

You have to hand it to Amazon. They've been doing an amazing job thinking outside the box lately. But, sometimes, the box is there for a reason.

Update: In the comments, a couple people are arguing that these accounts appear to be isolated in virtual machines and that I may be overstating the risk. They might be right, perhaps I am being too paranoid, especially given that there are easier targets out there.

Friday, December 09, 2005

Yahoo buys del.icio.us

Just eight months after taking funding, the popular social bookmark website del.icio.us gets acquired by Yahoo.

Jeremy Zawodny has the announcement on the Yahoo Search blog and Joshua Schachter announces on the del.icio.us blog.

Yahoo seems to be making quite a push on tagging and social software. It will be interesting to see how this plays out.

See also my previous posts, "Questioning tags" and "Yahoo gets social with MyWeb".

Update: Greg Yardley claims the deal is rumored to be for $30M and computes that that works out to roughly $100 per user. Yowsers. Greg goes on to say, "Yahoo didn't buy del.icio.us' technology; it bought our bookmarks and tags - and for quite a price."

Update: John Battelle says his sources also put the deal at around $30M.

Update: Paul Kedrosky's sno.oker.ed post is pretty amusing. Some good points from Paul and in the comments on the post. [via Om Malik]

Thursday, December 08, 2005

Survey paper "Deeper Inside PageRank"

Ho John Lee pointed to a long but truly excellent survey paper on PageRank, "Deeper Inside PageRank" (PDF) by Langville and Meyer.

The 46 page paper not only describes PageRank and twiddles of PageRank in detail, but also it talks about research on optimizing the PageRank computation and generating personalized versions of PageRank. It's a thick, dense paper, a lot of work to plow through, but I found a lot of juicy food for thought in there.

I ended the paper buzzing with questions, primarily around the probabilities of link transitions and the personalization (aka teleportation) vector. If you've got a good understanding of PageRank, I'd appreciate it if you could comment on my thoughts below and let me know if I've gone astray.

On the probabilities of transitioning across a link in the link graph, the paper's example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that "any suitable probability distribution" can be used instead including one derived from "web usage logs".

Similarly, section 6.2 describes the personalization vector -- the probabilities of jumping to an unconnected page in the graph rather than following a link -- and briefly suggests that this personalization vector could be determined from actual usage data.

In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these -- the probability of following a link and the personalization vector's probability of jumping to a page -- to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.

But, if I have enough usage data to do this, can't I calculate the equivalent PageRank directly? Let's say I have a toolbar (like Alexa, Yahoo, or Google) or ISP logs (like MSN or AOL) that gives me data on everything people visit. Instead of weighting the links in the link graph using the usage data, can I ignore the link graph and rank pages by their likelihood of being visited?

What's the difference between these two calculations? In one, I'm summing over the probability that surfers come over inbound links to find the probability that people will visit the page. In the other, I'm computing that probability directly from who actually visited the page. Using the link graph would seem to be only something to use if you don't have the usage data, an indirect estimate of the relevance of a page that you could calculate directly given enough data.

Now, I am assuming here that the value of a link is entirely determined by how much that link is used. If an unused link on a page does have meaning and should influence the relevance of the linked page, then this all falls apart.

But is this otherwise accurate? With enough data on what pages people visit, could the calculation over the link graph be eliminated or at least reduced?

By the way, don't miss the interesting discussion in the paper on using the personalization vector for personalized search by using a different vector for different groups of people. There are severe scaling issues with this method of search personalization, which can be partially addressed by some of the ideas from the Google Kaltix folks. For more on that, see my previous post, "More on Google personalized search".

[Ho John Lee post via Brian Dennis]

Update: Ho John Lee responded with some comments. Definitely worth reading.

Yahoo Answers and wisdom of the crowd

Jeremy Zawodny has the post on the Yahoo Search blog announcing a new product, Yahoo Answers.

Jeremy describes it as "a place to tap the collective wisdom of the Internet for advice, recommendations, theories, jokes, ... whatever."

Both Gary Price and Michael Bazeley talk about the similarity of the new Yahoo Answers with existing forums and message boards. I think this comparison is pretty accurate.

There are already moderated discussion forums where people can rate the quality of posts. Yahoo Answers would appear to be essentially the same thing, a user-moderated forum for people to talk about whatever.

Both Gary and Michael also contrast Yahoo's offering with Google Answers. Yahoo Answers is a free service where anyone can answer a question. Google Answers is a paid service where expert researchers answer questions. This difference is important.

Google Answers keeps quality high by charging a fee and restricting who can answer a question. Yahoo Answers hopes to keep quality high with a rating and reputation system.

Unless Yahoo Answers' reputation system includes something novel that does a better job of ferreting out experts, the site will have the same problem all user-moderated forums have. A popularity contest isn't the best way of getting to the truth.

People don't know what they don't know. Majority vote doesn't work if people don't have the information they need to have an informed opinion.

There was a case in the news a couple years ago of a legal advice site that had user-moderated forums. The idea was that lawyers would come on to the site, give short opinions, and use the goodwill gained to drum up future business.

A teenage kid with no legal training whatsoever hopped on to the system and started answering hundreds of legal questions with common sense answers. Despite the fact that some of his advice was wrong, badly wrong in some cases, he had the highest ratings on the site.

There is wisdom in the crowd. There is also a lot of noise. Separating the wisdom from the noise is the challenge.

Update: Looking at the Yahoo Answers point system, it appears to me that there is an incentive to answer as many questions as possible as quickly as possible without worrying about accuracy. I think that's going to need some tuning.

Update: Gary Price points out that Ask Jeeves had a very similar system to Yahoo Answers called AnswerPoint that they shut down in 2002. Why did they shut it down? Ask Jeeves SVP Jim Lanzone told Gary that the user base was very small, that "as a free service, there was little incentive for people to answer other people's questions," and that "it was usually just faster and easier for people to search normally ... than to submit a question to the community and wait for an answer."

Update: Nine months later, Philipp Lenssen posts an interview with a frequent contributor to Yahoo Answers named Michael. Michael said that, on Yahoo Answers, "the signal-to-noise ratio is astounding ... it's very difficult to sort the wheat from the chaff." He also disliked the Yahoo Answers point system, saying that they "encourage people to just give one-liner spammed responses to questions instead of actually putting in some thought."

Wednesday, December 07, 2005

Organizing chaos and information overload

In his recent post "Organizing Chaos", Peter Rip talks about the value of targeting content:
Targeting equates to value. Targeting specificity increases as volume increases, lifting the value of the entire inventory. It is a virtuous cycle .... More users generate and attract more content. Content expansion increases the value of targeting. Value is extracted by making the content more searchable, and ultimately, reusable.
This reminds me of what Bill Joy said about information overload:
Our lives are overwhelmed by all the information coming at us in a very disorganized way. We're going to hunger for something that will make sense of all the chaos--that will look at all the things happening in the world and filter and order them in a way that's personalized to us. That will be the next great revolution--that is something that doesn't take an index of the dead information on the Net, but the live information of things as they are occurring and as they are relevant to us.
Or what John Doerr said:
Maybe we'll get to 3 billion people on the web and say that what matters to all of us is information, and products, and more. Which is we live in time and we're assaulted by events. And, so, let's just say there's 3 billion events going on at any given time. And if you wanted to compute the cross product of the 3 billion people and the 3 billion events -- 'cause you need to filter very carefully the information that's going to get to this device -- I don't want to be assaulted by anything but the most relevant information ...
Or what Bill Gates said:
Workers are increasingly deluged with ... scads of information ... But finding just what they need when they need it is tough. "The software challenges that lie ahead are less about getting access to the information people need, and more about making sense of the information they have."
Or what John Battelle said:
Through the actions we take in the digital world, we leave traces of our intent, and the more those traces become trails, the more strongly an engine might infer our intent given any particular query ... I expect those trails ... to turn into relevance gold .... Clickstreams are the seeds that will grow into our culture's own memex -- a new ecology of potential knowledge -- and search will be the spade that turns the Internet's soil.
Or what I have said ([1] [2] [3]):
The urgent scaling problem for our users ... is scaling attention. Readers have limited time ... It will become harder and harder to find and discover the gems buried in all the noise. We need to help readers focus, filter, and prioritize.

There is tremendous potential in this flood of data, an opportunity to extract knowledge from the noise. ... There is wisdom in that crowd. All we need to do is find it.

Show me what matters. Help me find what I need .... Where before there was an undifferentiated glut of information, now there is focus. Where before there was noise, now there is knowledge.

Monday, December 05, 2005

Google's rules of management

In a Newsweek article, Google CEO Eric Schmidt says that Google's management philosophy gives them a competitive advantage over other firms.

A few highlights from the article:
Cater to their every need ... The goal is to "strip away everything that gets in their way." We provide a standard package of fringe benefits, but on top of that are first-class dining facilities, gyms, laundry rooms, massage rooms, haircuts, carwashes, dry cleaning, commuting buses -- just about anything a hardworking engineer might want. Let's face it: programmers want to program, they don't want to do their laundry. So we make it easy for them.

Data drive decisions. At Google, almost every decision is based on quantitative analysis. We've built systems to manage information, not only on the Internet at large, but also internally ... We have a raft of online "dashboards" for every business we work in that provide up-to-the-minute snapshots of where we are.

We adhere to the view that the "many are smarter than the few" ... At Google, the role of the manager is that of an aggregator of viewpoints, not the dictator of decisions. Building a consensus ... always produces a more committed team and better decisions.

Hire by committee. Virtually every person who interviews at Google talks to at least half-a-dozen interviewers ... Everyone's opinion counts, making the hiring process more fair and pushing standards higher ... If you hire great people and involve them intensively in the hiring process, you'll get more great people ... [a] positive feedback loop ... [with] a huge payoff.

A trusted work force is a loyal work force.
See also my previous posts ([1] [2] [3] [4]) on Google's exceptional benefits and the advantages it gives them.

See also my previous post, "The Human Equation", where I discuss a book that argues that investing heavily in your people pays off not just for knowledge workers, but for all workers.

[Newsweek article via Niall Kennedy]

Update: Some additional insight in a Business 2.0 interview of Eric Schmidt by John Battelle. [via Gary Price]

Sunday, December 04, 2005

Advanced search, PostScript, and improving search

Last night, I was trying to find something pretty specific, a PostScript program that generates random mazes when you send it down to your printer. I was having a hard time finding it with quick searches for "postscript maze" and the like, so I switched to advanced search.

What I decided to do was a search limited by filetype to Postscript (.ps) files with the word "maze" in the filename (or URL).

I was surprised to find that only Google supports this query (e.g. [allinurl: maze filetype:ps]). I think AltaVista used to be able to do it, but can't now that it is owned by Yahoo. Yahoo, MSN Search, Ask, none of the other engines can do this particular query.

There is a debate right now about whether search can be improved by giving people more powerful tools (advanced search, MSN Search's "Search Builder", Clusty's clustering, A9's "columns") or whether search just needs to do the right thing (question answering, personalized search).

While I'm not a huge believer in improving search with more powerful tools -- I don't think the mainstream will bother with them -- I'm surprised that advanced search isn't getting more attention. I was amazed that only Google supported this particular search.

By the way, PostScript is a full programming language, though a rather bizarre one, so it really is possible to write very short programs that generates mazes, fractals, and other goodies when you send them down to your printer.

I did a few of these back when I was in undergrad, but I've misplaced the files now, so I went searching so see what other people did. If you want to check out what I found, here's two ([1] [2]) of my favorites. They're PostScript files, so you'll need a PostScript printer or GSView to see them.

How is Yahoo My Web 2.0 doing?

It's been several months since Yahoo My Web 2.0 launched. How is it doing now?

Yahoo My Web 2.0 was announced as a way to overcome the "limits of web search" using "social search". Yahoo My Web 2.0 allows tagging of bookmarked web pages (like del.icio.us) and throws in some nice search and social networking features.

At the time it launched, I criticized Yahoo My Web 2.0 as being too much work for too little gain and doubted it would get mainstream adoption. Danny Sullivan at Search Engine Watch also was skeptical.

It's nearly six months later now, so let's go back and see how Yahoo My Web 2.0 is doing. Unfortunately, there's no easy way to get traffic numbers for the site, but Yahoo does post some metrics for My Web 2.0 on their home page. They say they have 407,819 pages and 99,800 tags.

This doesn't seem like a lot to me. "My Community" of 20 contacts has 2,821 saved pages and 1,411 tags. This would suggest that the entire community is only a couple orders of magnitude larger than my community. It seems that there may be just a few thousand people using Yahoo My Web 2.0.

What do you think? Are you using Yahoo My Web 2.0? If so, what do you like about it? If not, why not?

Update: Yahoo just acquired del.icio.us. My Web 2.0 was viewed by some as a del.icio.us knock-off. I wonder if Yahoo decided to build My Web 2.0 instead of buy del.icio.us 8-9 months ago, failed to get traction, and now are going ahead with the buy.

So, was the Orkut code really stolen?

About a year and a half ago, Wired reported that Google was being sued by a small company called Affinity Engines because of the code used for Orkut.

The suit alleged that Orkut Buyukkokten -- the engineer who wrote Google's Orkut social networking site and named it after himself -- reused Affinity Engines code when developing the Orkut site. Orkut Buyukkokten used to work at Affinity Engines.

What ever happened with this? Did Orkut steal the source code? I've heard rumors that Google is utterly in the wrong here, but the court case still is slowly grinding its way through. If there is evil to be found here, it seems to be buried under enough legal slime that it may be years before it is fully exposed.

But, looking at how Google's social networking site has languished, I can't help but think that this controversy is at least somewhat responsible. If Google was concerned about liability, they'd have a big incentive to discontinue development of Orkut.

What do you think? Was the code stolen? If so, is it a violation of Google's "do no evil" mantra? If not, why does Orkut seem to have been abandoned by Google?

Update: A few months later, the court case appears to have ended. When I inquired about it, I got the following statement from Michael Kwun at Google: "The parties have resolved their differences in this matter and have agreed not to share the terms of the agreement. We are very pleased with the outcome."

Friday, December 02, 2005

E-mail overload, social sorting, and EmailRank

Many of us know what it is like to be overwhelmed by e-mail. We fear our inboxes as a never-ending, poorly differentiated barrage, requiring laborious effort to manually skim, sort, and prioritize.

I look at this mess and think to myself, why? Why do I have to do this myself? Can't the computer help me here?

What I would like to see is a TrustRank-like system of propagating importance and reputation through a network of my e-mail contacts.

Here's how it would work:
Analyze who I e-mail, giving each person an implicit rank of importance based on my e-mail history.

Add in any people I explicitly indicate are important.

Propagate this importance through the network. That is, for each person I think is important, look at the people that important person thinks are important, and say those people must be at least somewhat important to me, then rinse and repeat.
So, now I have a large list of which contacts are important, important to me and to the community that surrounds me.

On any incoming mail, combine the importance of the contact with attributes of the specific mail message to mark the importance of the mail. Call it EmailRank, a relevance rank for incoming e-mail.

Seems like this wouldn't be too bad to implement at a large web-based e-mail site like GMail, Yahoo Mail, or Hotmail. They already have all the contact data right there. Build the graph, propagate, add analysis of the e-mail. I'm surprised it hasn't happened already.

See also a couple interesting Microsoft Research papers around this idea, "Attention-Sensitive Alerting" (PDF) and "The Social Network and Relationship Finder" (PDF). Neither discusses propagation of importance through a contact network, but both have a lot of ideas on methods for ranking e-mails using other data.

See also "Better e-mail prioritization going mainstream?" on TechDirt.

Update: There are also recent articles in PCWorld, CNet, and other sites on SNARF, the second of the two Microsoft Research projects I mentioned.

Microsoft Fremont is impressive

Kurt Weber (GPM, Microsoft Fremont) and Brady Forrest (PM, MSN Search) were kind enough to give me a demo this morning of Microsoft Fremont, an upcoming online, community-driven marketplace roughly similar to Craigslist.

Fremont emphasizes small scale selling to friends and acquaintances. These are easy transactions with people who have to see you again, so it makes for a friendly exchange with less risk of problems. I think it is quite likely I would prefer it over eBay, Craigslist, or selling used items at Amazon.

The goal is selling, not building up a network of friends. The social network is built implicitly from your list of IM buddies or from e-mail address groups, no fuss, no effort. Listing items is straightforward. The UI is clean and easy. It's a system you could see your grandma using.

It is impressive. They definitely have got the idea of social networking with a purpose.

I have to say, I'm surprised to see this from Microsoft. I would have expected to see this first from Yahoo as an outgrowth of Yahoo 360. Or from Amazon as some clever combination of Amazon's community features and Amazon's selling features. Or maybe from Google as an attempt to actually make Orkut useful.

Instead, Microsoft steps up to the plate. From what I've seen so far, it looks like they'll hit it out of the park.

See also my previous post, "Microsoft Fremont vs. Google Base".

Update: Three months later, Microsoft renamed Fremont to Windows Live Expo and launched it. It has no payment mechanism, so it basically feels like Craigslist with a couple social networking features thrown in. Microsoft Passport registration is required for use, something that I found to be an annoying hurdle.

In short, not as impressive as I had expected.

VCs and investing in "me-too" companies

John Cook at the Seattle PI talks about how VCs seem to trend follow, throwing money at startups with many existing venture-backed competitors:
VCs are investing in startup companies that already have four or five venture-backed competitors -- something I saw during the dot com boom of the late 90s and something that is occurring once again ...

Rustic Canyon Partners' Jon Staenberg said he is concerned about the "me-too" companies being formed in certain sectors. He said the world didn't need five or six online pet food stores during the late 1990s and it probably doesn't need five or six social networking companies today.

Some carnage will occur.
From my experience talking with VCs over the last couple years, I think VCs invest in "me-too" companies because there is less personal risk for them to do that than find and invest in new ideas.

Look at it from a VC's point of view. You have one company is entering a market space that several other VC firms have evaluated and blessed. Another company has a new product that has no market history, requiring a lot of effort to evaluate the potential.

Arguing for investing in the first company is easy. Just point at all the other interest and the due diligence the other firms presumably already did. If the company fails in the end, you can say, "Well, everyone thought this was a good idea. Not my fault."

Championing the second company is personally risky and hard. Due diligence on a new technology requires a lot of expertise, time, and work. In the end, a couple people at the firm will have to stick their necks out on the investment to get the other partners to go along, something that is personally risky for them if the investment fails.

While this may not be the best thing for the investors in the fund, the VCs are just behaving as rational actors, minimizing their personal risk. "Me-too" trend following is the result.

[I first posted a version of this as a comment on John's post. I later decided to cross-post it here.]

Wednesday, November 30, 2005

Is Web 2.0 nothing more than mashups?

Glenn Fleishmann was on NPR's The Works yesterday talking about Web 2.0. Glenn defined Web 2.0 as mashups, accessing and combining web APIs. Mashups and nothing but mashups.

When asked about business models for these mashups, Glenn talked about how they could start small with low costs, but said nothing about they might generate revenue or how far they can grow.

Similarly, I talked to Rael Dornfest a couple weeks ago. He also made it clear that he thought mashups are the next big thing.

When I asked Rael about some basic problems with mashups as a business (no service guarantees, limits on the queries of APIs, limits on commercial use of the APIs, numbingly slow, no barriers to entry), he had no answer.

I keep hearing people talk about as if companies are creating web services because they just dream of setting all their data free. Sorry, folks, that isn't the reason.

Companies offer web services to get free ideas, exploit free R&D, and discover promising talent. That's why the APIs are crippled with restrictions like no more than N hits a day, no commercial use, and no uptime or quality guarantees. They offer the APIs so people can build clever toys, the best of which the company will grab -- thank you very much -- and develop further on their own.

There is no business model for mashups. If Web 2.0 really is just mashups, this is going to be one short revolution.

See also my previous post, "Can Web 2.0 mashups be startups?"

Update: Richard MacManus has some good thoughts on this in his post, "Mashups: who's really in control?"

Tuesday, November 29, 2005

Microsoft Fremont vs. Google Base

Ben Charny at eWeek reports that Microsoft will soon be offering a competitor to Craigslist and Google Base:
Microsoft Corp. said it is readying an online marketplace, code-named Fremont, which is apparently in response to a similar feature that rival Google Inc. introduced a few weeks ago.

Fremont is a free service in which people contribute listings, whether it's about a couch for sale or someone looking for a commuting partner.
See also my previous post, "Google Base and getting the crap out".

[Found on Findory]

Update: The news about Microsoft Fremont appears to have been first reported by Michael Arrington at TechCrunch. Congrats on the scoop, Mike.

Update: Todd Bishop at the Seattle PI gives us a longer article about Microsoft Fremont. It sounds like Microsoft intends to do some interesting social network stuff with it, allowing selling to your network of friends.

Update: A little fun trivia, Todd Bishop got confirmation that Microsoft Fremont is named after the Fremont community in Seattle, the "center of the universe."

Update: Charlene Li saw a demo of Microsoft Fremont and describes in some detail why she thinks "Microsoft'’s classifieds service will be better than Google Base." Most of her criticism centers around poor usability of Google Base for mainstream users.

However, at this point, Google Base is more of a database than a end-user product, literally a base on which to build. It probably underestimates the potential to evaluate it directly against Craigslist or other classified sites in its current form. We'll likely soon see new products launched with a more mainstream look-and-feel and better usability layered on top of Google Base.

Update: Danny Sullivan makes a similar point, that Microsoft Fremont/Craigslist should be compared to a later Google Classifieds product, not to Google Base.

Update: Four months later, Microsoft renamed Fremont to Windows Live Expo and launched it. It has no payment mechanism, requires MS Passport to use, and generally has the feel of free classifieds with some social networking goo thrown on top.

It is interesting to note that Google went the other direction with this, integrating a payment mechanism into Google Base. This makes Windows Live Expo (aka Fremont) look more like a competitor to Craigslist and Google Base more like a competitor to eBay, Amazon, and Yahoo Stores.

Is personalized search a dead end?

Raul Valdes-Perez, CEO of the excellent clustering search engine Vivisimo, wrote a one page paper called "Why Search Personalization is a Dead End" (PDF).

He lists five reasons why personalized search is doomed:
People are not static; they have many fleeting and seasonal interests.

The surfing data used for personalizing search is weak [compared to purchase data].

The user's decision to visit the page is based on the title
and brief excerpt (snippet) that are shown in the search results, not the whole page.

Home computers are often shared among family members.

Queries tend to be short.
The criticisms might be summarized as a claim that clickstream data is too dynamic, noisy, and sparse to support personalization.

There are two problems with this argument. First, Amazon.com's personalization works just fine from similar clickstream data. Sure, it's true that the data is dynamic, noisy, and sparse, but Amazon deals with that by using algorithms that adapt rapidly, are tolerant to errors, and work from very little data.

Second, personalization doesn't have to be perfect. It just has to be better than the alternative. In Amazon's case, the alternative to a personalized front page is a generic front page with a top sellers list or a bunch of marketing goo. It's easy to be more useful to shoppers than that. Mistakes are just fine. The guesses just need to be right more often than the alternative.

Personalized search is no different. The algorithms need to adapt rapidly, be tolerant to noise, and work from little data. Mistakes are fine. Personalized search just needs to be more useful than unpersonalized search.

See also my earlier posts, "Perfect Search and the clickstream" and "Personalized search vs. clustering".

[Valdes-Perez paper via John Battelle]

Monday, November 28, 2005

Is personalized advertising evil?

In a long post about unethical behavior at software firms, Alex Bosworth has an interesting rant against personalized advertising:
Until privacy advocates raised a ruckus, Google engineers had big plans for mining a user's email trove to offer them precisely targeted advertisements ...

A real life equivalent to Google's personalized advertising dream would be a store where an anonymous greeter already knew not only your name but everything about you including the contents of your most intimate communications, and attempted to direct you to things they thought you might buy.

This is not only extremely off-putting and an abuse of trust to many, it's also potentially disruptive to the decisions a person might normally take.
On the one hand, personalized advertising could open the door to a new forms of invasive annoyances. I am reminded of the scene in Minority Report where sales avatars on screens call out to you by name as you walk past. Or the disturbing vision of the future in the Golden Age series where flying banner advertisements swarm through the air, something so obnoxious that everyone uses cybernetic implants that alter their perception of the world and filter out the ads.

On the other hand, personalization may offer the key to relevance. We are all bombarded by advertising in our daily lives. Junk mail, ads in magazines, it is all ineffective mass market noise pummeling us with things we don't want. But, companies with a helpful product need some way for interested people to find out about it. What I would like is a way for the ads to be limited to only interested people. Don't waste my time, tell me about something that actually might interest me. I'd like the advertising to be useful.

I think that being obnoxious always fails in the long run. Spam e-mail is now filtered. Pop up ads are blocked. Obnoxious personalized advertising would be no different. People hate obnoxious.

But being relevant and useful always pays off. Personalization can help people find and discover relevant information they wouldn't have found on their own. Personalized advertising can be useful. And people like useful.

Saturday, November 26, 2005

Amazon adds product wikis, tagging

Amazon has added a "ProductWiki" to some product pages, following a move a couple weeks ago to test tagging of products.

For as long as I can remember (at least 1996), Amazon has allowed customers to review and comment on products on their site. These ProductWiki and tagging experiments seem to be attempting to build off the success of customer reviews by gathering additional user contributed content.

I think these experiments are pretty interesting, but I'm not sure these particular efforts are likely to bear fruit. Wikipedia fights off spam and crap by having a couple thousand dedicated volunteer editors who track recent changes closely and revert bad content changes quickly. Amazon will not have that for their ProductWikis. Tagging in Flickr works well because metadata isn't available for photos unless users provide it explicitly. Products already have metadata, including keywords extracted from the descriptions, so the value of tagging isn't as obvious.

Regardless, it's great to see this kind of experimentation from Amazon. From customer reviews to user pages to friends pages, Amazon was very early with community and social networking features. There's much opportunity here.

[Found on Findory]

Wednesday, November 23, 2005

Web 2.0 bingo

Steven Cohen points to Web2.0bingo.com, a site that lets you quickly print random bingo boards with Web 2.0 buzzwords.

Looking at this with Findory... Bingo! 13 out of 24. We're highly buzzword compliant, yippie.

Tuesday, November 22, 2005

Perfect Search and the clickstream

John Battelle generously sent me a signed copy of his new book, "The Search". Thanks, John!

The book is a fun read, a great overview of the history of search companies with some interesting thoughts on the future of search.

The last chapter, Perfect Search, looks forward to the next-generation of search engines. It has several pages on clickstream personalization. An excerpt:
Perfect search ... means nothing if the engine does not understand you -- your likes and dislikes, your tendencies and tics.

A solution to this problem lies in the domain of your clickstream. Through the actions we take in the digital world, we leave traces of our intent, and the more those traces become trails, the more strongly an engine might infer our intent given any particular query ... I expect those trails ... to turn into relevance gold ....

Clickstreams can provide a level of intelligence about how people use the Web that will be on an order of magnitude more nuanced than mere links, which formed the basis for Google's PageRank revolution ....

Clickstreams are the seeds that will grow into our culture's own memex -- a new ecology of potential knowledge -- and search will be the spade that turns the Internet's soil.

Engines that leverage clickstreams will make link analysis-based search (nearly all of commercial search today) look like something out of the Precambrian era ... We have yet to aggregate the critical mass of clickstreams upon which a next-generation engine might be built ... [but] we're already pouring its foundations.
Current search engines treat each search independently, ignoring the valuable information about what you just did, about what you just found or failed to find. Paying attention to that clickstream will allow search to become more relevant, more useful, and more helpful, all with no effort from searchers.

Mixing local into Google's Froogle

It's being widely reported that Google's metashopping site Froogle now lets shoppers see what items are available at physical stores nearby.

Coverage seems quite weak at the moment. Here's a search for "earrings 94301" that returns almost no results in the Palo Alto area. I would hope that would improve quickly since the service is nearly useless as is.

However, looking forward, this brings Google much closer to the disintermediation threat that has struck fear into retailing giants. Improve the coverage, hook this up to Google SMS, and you've got a service that might "be able to tell Wal-Mart shoppers if better bargains are available nearby."

Monday, November 21, 2005

Food, perks, and competing with Google

Marc Ramirez at the Seattle Times has a fun article on the food at prominent Seattle companies.

At a couple points, Marc compares the offerings to the "Google Nirvana," saying that "Google single-handedly has redefined the meaning of corporate cafeterias."

I am quoted in the article as saying that Google is "in a class of its own." The article also quotes from this blog post where I said that "investing in your people pays for itself."

It is amazing to me that Microsoft cut their perks at a time when it is trying so hard to compete with Google. That is not helpful if you want to retain your best people.

Sunday, November 20, 2005

How Google tamed advertising

Randall Stross at the New York times has an interesting column today, "How Google Tamed Ads".

An excerpt:
Five years ago, Web advertisers were engaged in an ever-escalating competition to grab our attention. Monkeys that asked to be punched, pop-ups that spawned still more pop-ups, strobe effects that imparted temporary blindness - these were legal forms of assault.

The most brazen advertiser of all, hands down, was X10, a little company hawking security cameras, whose ubiquitous "pop under" ads were the nasty surprise discovered only when you closed a browser window in preparation for doing something else.

Today, Web advertisers by and large have put down their weapons and sworn off violence. They use indoor voices now. This is a remarkable change.

Thank you, Google.

Without intending to do so, the company set in motion multilateral disarmament by telling its first advertisers in 2000: text only, please. No banner ads, no images, no animation. Just simple words.
Exactly right. Be relevant, not annoying.

If the ads are well targeted and interesting, people will stop ignoring them. When I search for products on Google, I often skim over the ads as well as the web search results. The Google search ads are useful, often a link to exactly what I need, as relevant as much of the other content on the page.

But, most ads still are targeted only using the content on the page. This works okay for search, when I'm on a focused mission, but less well for content sites like news or weblogs. We can do better.

The next step is to personalize the advertising. Sites need to learn about me. Pay attention to what I'm doing and what I like. They shouldn't waste my time with things they know will be irrelevant to me. Content sites should show me ads for things I might actually want.

See also my previous posts, "Make advertising useful" and "The content should find you".

Friday, November 18, 2005

A data center in a trailer

The I, Cringely article this week, "Google-Mart", has an interesting rumor about Google's prototype of a data center stuffed into a truck trailer:
In one of Google's underground parking garages in Mountain View ... in a secret area off-limits even to regular GoogleFolk, is a shipping container. But it isn't just any shipping container. This shipping container is a prototype data center.

Google hired a pair of very bright industrial designers to figure out how to cram the greatest number of CPUs, the most storage, memory and power support into a 20- or 40-foot box. We're talking about 5000 Opteron processors and 3.5 petabytes of disk storage that can be dropped-off overnight by a tractor-trailer rig.

The idea is to plant one of these puppies anywhere Google owns access to fiber, basically turning the entire Internet into a giant processing and storage grid.
The article goes on to claim that Google will take over the internet, crush Yahoo and Microsoft, enslave all of humanity to the new Google hive mind, blah, blah, blah. Not so sure about that part.

But this trailer rumor, if true, is an interesting step in mass production of data centers. Google already just wheels in racks, 40-80 computers in each, plugs 'em in, and sets up their data centers very quickly. Now, maybe they'll just drop off a couple trailers, plug them in, and -- poof! -- instant data center.

[via Don Dodge]

Update: In his latest column, Cringely extends this vision with Google Cubes in every house, on every TV, on every phone, everywhere. Interesting mind trip, though a little too reminiscent of the Borg for my taste.

Security hole in Google Sitemaps

Danny Sullivan reports that David Naylor and others discovered a security hole in rights to access statistics through Google Sitemaps.

To prove you own the website you want to access, Google Sitemaps asks you to drop a file with a long code in the filename at the root level of your website (e.g. 1029392729387.html). It then checks to make sure this file exists and, if it does, it gives you access.

The problem is that it only checks if the file exists. As David and Danny point out, many websites -- including huge ones like eBay, AOL, and Google's own Orkut -- display a nice error message to users on invalid pages that say something like, "Hey, this page doesn't exist!" Google Sitemaps sees that error page is returned with a 200 code, not a 404 code, and concludes, "Huh, look, the page exists!"

Oopsie. Because of this error, Danny and David managed to access the Google Sitemap stats for eBay, AOL, and other websites.

On the one hand, Google is right that websites really should return a HTTP "not found" code (404) for pages that are not found. On the other hand, many, many sites don't.

This reminds me of the problems with caching and prefetching when Google Web Accelerator launched. Google assumed all websites strictly obeyed the HTTP spec, but they don't, so the tool didn't work properly. You need to work with reality, Google, not the way things should be.

This really is pretty lame of Google. Many other sites have to deal with this same kind of "claim your site" problem. They often do it by requiring you to put a code in a comment in one of your webpages, not by creating a new file, but there's any number of other ways to do it that work just dandy and don't open huge security holes.

C'mon, Google, you're supposed to be better than this.

Update: About 8 hours later, Stefanie Olsen says that Google has fixed the issue. Quick response, excellent.

Update: Another security problem at Google, a cross-site scripting vulnerability in Google Base. Apparently, the problem already has been fixed. [via Nathan Weinberg and Danny Sullivan]

Update: When it rains, it pours. Another recent security issue, this one in the Google Mini, that could have allowed arbitrary command execution. It already has been patched. [via Danny Sullivan]

Thursday, November 17, 2005

Netflix and personalization

Davis Freeberg has an interesting writeup of an investor presentation by Netflix CFO Barry McCarthy.

An excerpt on Netflix personalization and recommendations:
The personalization of their site is really what makes their service so unique. At this point Netflix has now collected over 1 billion ratings for moves. They use these ratings to make recommendations of long tail content for their consumers.

[McCarthy said,] "We help you find movies you've never heard of ... There were 554 movies released theatrically last year and I bet most of us can't name 20. A lot of those movies you would enjoy, if you knew that they existed. If you don't know that they existed then they might as well have not been invented."

He later goes on to compare this approach with Blockbuster.

"Historically Blockbuster has reported that about 90% of the movies they rent are new theatrical releases ... They have a slightly different mix online ... 70% of what they rent online is new releases and about 30% is back catalog."

"About 30% of what we rent is new releases and about 70% is back catalog and it's not because we have a different subscriber. It's because we create demand for content and we help you find great movies that you'll really like, we do it algorithmically and we do it with recommendations and ratings."
Personalization aids discovery in the long tail, pushing consumer demand into the back catalog.

Personalization surfaces interesting items you didn't know about and wouldn't have found on your own. It's a complement to search. Search helps when you know what you want. Personalization helps when you don't know what's out there.

[via TechDirt]

Free Google WiFi in Mountain View

Matt Marshall at SiliconBeat reports that Google will be providing free wireless internet access to everyone in the city of Mountain View.

Google also has an offer out to provide free WiFi to all of San Francisco.

So, Google, when are you coming to Seattle? Pretty please?

Google gobbling up Riya?

Om Malik posts a rumor that Riya may be getting acquired by Google.

Riya does automatic people and object recognition in photos. The technology supposedly can automatically tag photos with descriptions of the content of the image. It has obvious applicability to Google Image Search and Picasa.

Interesting if it turns out to be true. Yahoo's focus seems to be on community and user-generated content (like Flickr tagging). Google focuses on automation using clever algorithms. Google acquiring Riya would fit that pattern.

[via Niall Kennedy]

Update: Michael Arrington has a fun sneak peek at Riya's technology.

Update: The rumor was bogus.

Update: A year later, Riya gives up on face recognition as too hard and switches to recognizing characteristics of products. Very lame given all the Riya hype. Riya appears to have been yet another company with vaporware. They made strong claims about solving hard problems that they never could actually solve.

Wednesday, November 16, 2005

Google Base and getting the crap out

Bindu Reddy has the post on the Google Blog announcing the launch of Google Base.

Widely hyped as a possible Craigslist and eBay killer, Google Base looks to me a lot more like a slightly more structured version of a wiki. You can add nearly any content you want defined by nearly any fields you want.

There isn't all that much content yet but, as content is added, the trick will be keeping spam and crap out. I expect Google Base to be treated like a free version of Google AdWords by many. I doubt it will take long for people to upload your usual assortment of credit card offers, domain name services, get-rich-quick schemes, and exciting new ways to increase the size of your willy to elephantine proportions.

So, how will they help people find the relevant stuff and filter out the crap? At this point, it isn't clear. We'll have to wait and watch.

See also my previous post, "Getting the crap out of user-generated content".

See also good comments on Google Base by Nathan Weinberg, John Battelle, Tara Calishain, Gary Price, Danny Sullivan, and TechDirt.

Update: Google Base does appear to be using a few techniques to reduce crap, including automated detection of naughty or spammy words, community reporting of bad items, and, when searching, suggestions of categories and tag terms to refine the search and improve relevance.

It will be interesting to watch and see how well these techniques scale over time.

Update: Wow, it sure is easy to upload crap to Google Base. It takes RSS feeds of 1k items at a time.

I think we're going to see some heavy abuse of this, especially if Google starts including Google Base search results in Google web search (like they do for Google News and Froogle search results). That would create quite the profit motive.

Here's one crazy idea for abusing the system. Someone should try uploading all of Amazon.com's RSS feeds but inserting an associate tag into the URL. Same thing is probably possible with eBay and many other places that offer referral kickbacks. Woo hoo, go wild, script kiddies.

Update: A week later, John Leyden at The Register reports "Google Base awash with smut".

Update: Three weeks later, Nathan Weinberg reports that "product listings on Google Base have become almost entirely filled with [affiliate] redirect URLs to Amazon." Woo hoo, looks like the script kiddies went wild.

Monday, November 14, 2005

Alex takes the red pill

Alex Edelman is leaving Findory and joining IMDb today.

If you have never checked out the Internet Movie Database (IMDb), you definitely should. It's filled with detailed information and reviews about movies. Don't miss their remarkable power search that lets you do elaborate searches for things like "highly rated Sci Fi movies made after 2001 with more than 500 votes".

The problem with being a frugal, self-funded startup is that you're a frugal, self-funded startup. After 14 months without a salary, Alex felt he needed a steady source of income, and Findory is not able to provide that for him.

I am proud of what Alex and I built together at Findory. The last 14 months have been extraordinary:
  • Traffic growth: In the last 14 months, Findory traffic grew nearly 1800% from 250k hits/month to 4.4M hits/month.

  • Redesigns: We had two major redesigns (original, first, current).

  • Servers: Findory deployed four additional servers as we grew, bringing our total to six.

  • Press: We enjoyed seeing press coverage in Time Magazine, Spiegel, The Times, eWeek, Seattle PI, Seattle Times, Puget Sound Business Journal, Searcher Magazine, Forbes.com, Search Engine Watch, InsideGoogle, Online Journalism Review, and many other places.

  • Millions of feeds: We launched millions of different RSS feeds, helping people read and consume information from Findory any way they like it. Our unusual personalized versions of our feeds learn and adapt as you read articles from the feeds.

  • Inline Findory: Inline Findory lets bloggers put a snippet of their Findory front page on their weblog.

  • Findory API: The Findory API lets other sites remix Findory data for fun and profit.

  • Personalized web search: Our alpha of our personalized web search modifies web search results using each person's search and clickstream history.

  • Personalized news and blogs searches: Our personalized news and blogs searches highlight articles in your search results that Findory recommends based on your reading history.

  • Search history: Search history makes it easy to find things you found once before.

  • Source pages: For any news site or weblog in our database, source pages show recent articles, related news sites and weblogs, and related articles.

  • Findory feed reader: In September 2005, we launched our personalized feed reader. Unlike other feed readers, Findory's feed reader recommends articles that are particularly likely to be interesting to you.

  • Personalized advertising: Recently, we launched our personalized advertising engine. It picks advertisements based on not only the content of the page, but also which articles each reader has read in the past.
I wish Alex the best of fortune at IMDb and in the future. We achieved a lot together at Findory, but there is much left to do. I look forward to continuing to innovate and build on Findory.com.

Thursday, November 10, 2005

Personalized news search from Google?

Chris Sherman reports that Google will be launching personalized news search soon:
Google plans to integrate personalized search with Google News ... You'll be able to see the history of past news searches and the articles that you clicked on.

Since Google only maintains links to news stories up to a maximum of 30 days after publication, you may not be able to retrieve the article from your history. However, both the title and URL of stories are preserved, and you will be directed to news site to search for the article using the news service's own site search or archive tools.

Google says that the integration of Google News into personalized search will be coming "soon."
Excellent. Findory has had this for a long time, but it is great to see it from Google. It doesn't appear that your Google News reading history will personalize the Google News front page like Findory, but that may be coming at some point as well.

The beginning part of Chris' article talks about Google Personalized Web Search. If you haven't tried that yet, you should. It's the only example from any of the search giants of showing different search results to different people based on what each person has done in the past.

There are smaller folks exploring personalized web search including Findory. And it's interesting to look at the differences between Findory's alpha personalized web search and Google's personalized web search. Google's technique is to bias all your search results toward your profile (e.g. read an article on fly fishing, then a future search on "bass" is biased toward fishing, as is a future search on "computer").

Findory's personalized search (which admittedly is much less mature) tries to change your search results based on what you just did. If you do search, don't find what you want, then twiddle your keywords and search again, there's valuable information there. What you did or didn't find in your first search should influence what you see in your second search.

That's the big difference. Findory's technique tries to make fine-grained changes to your search results to help you with whatever you're trying to do right now. Google's technique makes coarse-grained changes (e.g. a bias toward fishing) using a long-term profile.

Fun stuff. It's great to see Google pushing hard on personalization of information.

Sunday, November 06, 2005

Topix.net adds blogs

Rich Skrenta posts that Topix.net just added thousands of blogs to their site, crawling them and categorizing them along with articles from thousands of mainstream sources.

Topix.net says they added the 15k "top weblogs" to their search results mixed in with mainstream news sources.

For example, a search for "Amazon Mechanical Turk" on Topix.net brings up articles from news sites and a few dozen blogs including my post.

It doesn't dive as deep as Technorati, Feedster, or Google Blog Search, but does give a good number of high quality results. I saw no spam in my search results for the examples I tried, something that is a serious problem on those other blog search engines.

Rich did say that the 15k blogs are just a start and that they hope to expand their coverage up to 1M weblogs. It'll be interesting to see if they can maintain their spammy-free goodness as they broaden their coverage.

Speaking of spam, Rich said:
What we're seeing is that 85-90% of the daily posts hitting ping services such as weblogs.com are spam (take a look for yourself). Of well-ranked non-spam blogs that we've discovered, we've found about half haven't been updated in the past 60 days. Our filters sift through what's left, which even after discarding 95%, is still a great deal of good material.
Lots of crap out there, isn't there?

For more on blog spam, see my earlier post, "How many feeds matter?", where I said that, based on Findory's experience and data from Bloglines, 95% or more of the supposed 20M+ weblogs out there appear to be fake, not useful, or spam.

Saturday, November 05, 2005

Just Google it and disintermediation

Steve Lohr at the New York Times reports that fear of disintermediation by Google is hitting retailers as powerful as Wal-Mart:
In Google, Wal-Mart sees both a technology pioneer and the seed of a threat ... The worry is that by making information available everywhere, Google might soon be able to tell Wal-Mart shoppers if better bargains are available nearby.
Google is pretty close to that already. In a retail store, when you're looking at something on the store shelf, try using your cell phone to send a text message to Google SMS with the word "price" and the UPC code or a brief description. Google will get back to you in a few seconds with what online retailers are charging for that item. Fun stuff.

[Found on Findory]

Friday, November 04, 2005

Amazon Mechanical Turk?

This has to be the strangest thing I've seen in a while.

Amazon is apparently behind the site mturk.com which calls itself "Amazon Mechanical Turk: Artificial Artificial Intelligence".

According to part of their FAQ:
Amazon Mechanical Turk provides a web services API for computers to integrate "artificial, artificial intelligence" directly into their processing by making requests of humans.

A network of humans fuels this artificial, artificial intelligence by coming to the web site, searching for and completing tasks, and receiving payment for their work.

For software developers, the Amazon Mechanical Turk web service solves the problem of building applications that until now have not worked well because they lack human intelligence. Humans are much more effective than computers at solving some types of problems, like finding specific objects in pictures, evaluating beauty, or translating text.

For businesses and entrepreneurs who want tasks completed, the Amazon Mechanical Turk web service solves the problem of getting work done in a cost-effective manner by people who have the skill to do the work.

For people who want to earn money in their spare time, the Amazon Mechanical Turk web site solves the problem of finding work that they can do wherever and whenever they want.
For those that doubt that Amazon would do something this... umm... innovative, a quick view of the page source shows that many of the images and links are served from Amazon.com. This really does appear to be Amazon.

I really don't know what to say. I have a hard time seeing how this idea can succeed.

Google Answers works because the fees are high, answers quite complex, and experts well vetted. The core idea behind Amazon's Mechanical Turk seems to be to take the success of Google Answers and try to scale it up by a few orders of magnitude.

But there's problems with that. If I scale up by doing cheaper answers, I won't be able to filter experts as carefully, and quality of the answers will be low. Many of the answers will be utter crap, just made up, quick bluffs in an attempt to earn money from little or no work. How will they deal with this?

It seems to me that Amazon has just changed the problem from finding the answer to the problem from the data available to digging out the correct answer from all the crappy answers provided. Filtering crap out of user generated content at large scale is a difficult problem too.

More comments and discussion at Metafilter, TechDirt, Google Blogoscoped, Rob Hof, Jason Fried, Greg Yardley, and Slashdot.

Update: Don't miss the ongoing discussion in the comments to this post.

Update: I have a short quote in a Seattle PI article by Kristen Bolt on Amazon Mechanical Turk.