Monday, August 29, 2005

The frontier of search

Terry McCarthy at Time magazine reports "On the Frontier of Search", covering innovations in the search and the future of search.

The article starts with an example of ubiquitous personalized search:
You land late in the evening in a city where you know nobody. You did not have time to book a hotel...

You search for a hotel room; the screen of your cell [phone] shows you pictures of several hotels in your price bracket, with views from individual room windows. Your search engine ... tells you that your favorite blues band will be playing at a festival in the city's park over the weekend. The engine can search your desktop back home, and it reminds you that a college friend e-mailed you a year ago to say he and his wife were moving to this city (you had forgotten). You decide to invite them to the festival.

What you have just tasted is the future of search.
Gary Flake has a nice quote that motivates personalized search:
"Search will ultimately be as good as having 1,000 human experts who know your tastes scanning billions of documents within a split second."
Gary Flake used to be at Yahoo Research, but was recently poached by Microsoft.

Findory has a nice mention in the article:
One of the hottest and most controversial new areas is designing software that will get to know individuals' interests, mostly through their search history--the clickstream. Findory, a Seattle-based news-search site launched in January 2004, provides access to news stories and blogs. As you start searching for certain types of stories, the site gradually learns about your preferences, and the home page evolves to mirror your interests.
A9's storefront photos, Yahoo and Flickr's tagging ("better search through people"), and Oren Etzioni's KnowItAll research project also were mentioned.

[Full disclosure: Terry interviewed me for this article.]

See also comments on the article from Gary Price at Search Engine Watch.

Your automated Web 2.0 business plan

Nat Torkington wrote an amusing custom Web 2.0 business plan generator.

Hit reload a few times to see it come up with a few random examples. Some of the best I saw:
Incentivized revolutionary emergent social network infrastructure that leverages the long tail.

Remixable hybridized distributed calendaring desktop app that leverages network effects.
Heh, heh. Very funny.

Sunday, August 28, 2005

Findory and the geek detective

Findory pages us if she's feeling unhappy. This happens surprisingly infrequently, every 2-3 months or so, which I think is pretty remarkable for a website this complex.

But it does happen. Unfortunately, last night the alarm bells went off.

When the website is ill, the cause isn't always immediately obvious. Sometimes, you have to play detective, gather clues, find evidence, and build a case that fits all the facts.

Last night was one of those times. Findory was intermittently responsive. Some requests would hang until they timed out, others would get a web page back.

Huh. What's going on? The first step is to find a reproducible test case, something that reliably hangs and can be debugged. I found one, a page that, at the time, seemed to hang every time I loaded it.

This first piece of evidence would seem to point to a problem with our web servers. If other pages load but this page does not, something that happens in the act of generating this page must be responsible.

But what is it? At this point, I pulled out the debugger and walked through what Findory does when it generates that particular page. After a while, I isolated the problem to a line where Findory connects to a remote database.

Now, this is a little strange. We have two pieces of evidence that are somewhat contradictory. Hanging on a specific page would seem to indicate a problem with something particular to that page. But debugging the issue seems to point to the database connection, something that happens on most or all pages. Hmm...

Still suspicious that the problem had something to do with that page and the data on that page, I started looking at the data used on that page. Any errors in the database logs? No. Any evidence of database corruption? No, the tables and files checked fine.

Could it be network trouble? Ping times were good, no packet loss. DNS lookups seemed to be succeeding just fine in a couple quick tests. Hmm...

I need more clues. I look in the error logs for the website. Are the errors occurring just on some pages or across many pages? The errors seem to be happening on many pages, more than I realized at first. This piece of evidence combined with the debugging trace would seem to point firmly at the database.

Now, I try connecting to the database from several remote boxes that have access. Huh, they're hanging. This is completely outside of Findory code now, just database client to database server, and the connection is hanging. It is definitely the connection to the database.

So, not Findory code, not database corruption, but new connections to the database are hanging. Why?

Could it be running out of connections? No, changing that didn't help and, I realized belatedly, running out of connections would return an error, not hang.

What could it be? Maybe I should check network and DNS again. Network checks out. DNS... oh wait! The first DNS server in /etc/resolv.conf is unpingable! I didn't notice that during my first quick check of DNS since dig lookups worked.

So, we have another clue. Could a bad DNS server cause our database to hang? Where and why is our database doing DNS lookups?

Turns out that our database does reverse DNS lookups for access control. While we weren't using this feature, we also neglected to disable it.

So, the solution turned out to be to remove the ill-behaved DNS server from /etc/resolv.conf, disable reverse DNS lookups in our database, and add some additional error handling to the Findory database layer to make any similar problem easier to debug in the future. An hour of downtime, but we're back.

In retrospect, there were a couple things I could have done to debug this faster. First, I should have suspected network and DNS earlier; 4 of the 6 outages we've ever had have been due to network or DNS problems in our datacenter, not a problem isolated to our code. Second, a netstat on the database server might have shown hanging DNS connections, pointing to the problem much earlier.

Good to have the problem resolved. Frankly, if we didn't feel such pressure to get Findory back up for our readers quickly, this kind of detective work almost would be fun. But that's just the inner geek talking.

Wednesday, August 24, 2005

Personalized search paper from MSR

An interesting SIGIR 2005 paper by Teevan et al. called "Personalized Search via Automated Analysis of Interests and Activities" investigates "67 different combinations of how the corpus, users and documents could be represented" for personalized search.
We found that there is an opportunity to achieve significant improvement by custom-tailoring search results to individuals, and were thus motivated to pursue search algorithms that return personalized results instead of treating all users the same.
I like this excerpt that argues for implicit personalization:
People are typically unwilling to spend extra effort on specifying their intentions .... A promising approach to personalizing search results is to develop algorithms that infer intentions implicitly rather requiring that the user's intentions be explicitly specified.
Unfortunately, the paper only explores keyword-based approaches that build a profile of keyword interests from past behavior and favor search results with those keywords. The paper doesn't cover collaborative filtering or social networking approaches that share data among users or the subject-based personalization used by sites like Google Personalized Search.

Susan Dumais and Eric Horvitz from Microsoft Research are co-authors on this paper. Susan and Eric have had some fun research work lately, including the Memex-inspired Stuff I've Seen project and the personalized news prototype NewsJunkie.

Thanks, Gary Price, for pointing out the recent SIGIR 2005 conference. A bunch of good papers this year, well worth a peek.

Update: I didn't realize it at first, but I think this Teevan et al. paper is about the same work as a project that was demoed at Microsoft TechFest earlier this year. The TechFest prototype uses the files on your desktop (Word files, Excel files, etc.) to personalize your web search. I'm still as skeptical now as I was then about that idea, but it's interesting to see more details in this SIGIR paper.

Update: Jon Gordon on NPR Future Tense interviewed Jaime Teevan today about her research work. [via Lee Odden]

Google Talk and VoIP over WiFi

David Bau has the post on the Google Blog about Google Talk, the new instant messaging and voice-over-IP application from Google.

This is being widely covered by many people who seem to use IM and VoIP much more than I do. Don't miss the detailed comments and reviews from Joe Beda, Nathan Weinberg, Danny Sullivan, John Battelle, Nat Torkington, John Markoff, and Matt Marshall.

Rather than duplicate their fine work, I'm going to depart from analysis and into wild and irresponsible speculation. Put on your galoshes and try not to get any on you.

There's been rumors that Google is thinking about offering free WiFi nationwide in the US. Most of the discussion on this has focused on the impact on broadband ISP providers. But what about cell phones?

I'm amazed that people put up with the poor quality of cell phone connections. I find it nearly impossible to carry on a long conversation cell phone to cell phone because at least one of the participants usually has a weak connection. Even cell phone to land line conversations can be difficult. This is aggravated by the fact that I often want to call from inside my house, a restaurant, or a cafe, areas where cell phone coverage is usually weakest.

So, okay, what if I had a phone that works over WiFi? I have WiFi in my house and many businesses I frequent have WiFi. There are some WiFi phones starting to be available. That might fix some of the coverage problems indoors.

But wait, what if there was city-wide WiFi coverage? Or WiFi coverage equivalent to cell phone networks (covering cities and major highways)? Huh, my WiFi phone would work everywhere.

So, Google just launched Google Talk, a VoIP application. Google is rumored to be thinking about a nationwide free WiFi network. Combined these two, add a WiFi phone to the mix, and am I about to get free mobile calling nationwide?

Okay, coming back to reality, probably not. No business model, expensive, need infrastructure, need bandwidth. Lots of problems. But it would be glorious, wouldn't it?

Update: MIT Tech Review reports a startup called Fon is trying to launch a product with exactly this model, ubiquitous VoIP over WiFi as an alternative to cellular service.

Update: Eleven months later, the NYT writes about the growth of WiFi phones and the threat to cellular networks.

Update: Twelve months later, Katie Fehrenbacher at Gigom tests using the Google WiFi in Mountain View for making calls and says, "If you find a spot where the signal is pretty strong, the calls can be as good as cellular calls. It might be just me, but the prospect of cheap or free phone calls over a free network, is something to get excited about."

Monday, August 22, 2005

Google Sidebar and personalization

Google just launched Google Desktop Search V2 which includes a feature called Sidebar. Sidebar puts a column up on your desktop that shows news, RSS feeds, previously viewed web pages, stock quotes, weather, and photos.

It's got some interesting personalization features. The News widget in Sidebar personalizes the news you see based on your reading behavior. From the "About" for the News widget:
View news that is personalized based on the articles you read. For example, if you read lots of sports news, you'll see more sports articles. If you read technology news less often, you'll see fewer of those articles.
It sound like this is subject-based personalization (e.g. showing more sports news if you read sports news), not the fine-grained personalization of Findory.

So, for example, reading a BusinessWeek article on innovation in Silicon Valley versus India in the news section of Google Sidebar changed the featured headlines to show me a few top business stories (I got an article on Vioxx lawsuits). Reading the same article on Findory surfaced articles on IBM outsourcing to Russia and China, innovation at Motorola, and growth of the startup Akimbo. Findory is more focused, more fine-grained, and more interesting.

If Google is just using subject classifications for the personalization, that might make it similar to the recommended stories at MSNBC and MSN Newsbot. Nevertheless, I am sure this will get more sophisticated with time. With both Google and Microsoft taking steps toward personalized news, this area is likely to heat up fast.

There's other personalization in the Sidebar. For example, the feed reader (called "Web Clips") automatically adds the feeds for the weblogs you visit and shows you a combined view of all your feeds.

This makes it easy to set up and configure this feed reader since, well, it requires no set up or configuration at all. It learns what feeds you might like from your behavior and does all the work for you. Very nice.

The stock quotes also learn from your actions. If you search for "GOOG" on Google and then click on the quote link in the search results, a quote for GOOG will be added to the Stocks widget in Sidebar and constantly update.

The weather came up as Seattle, WA for me automatically. Not clear if this was from geolocation (using my IP address) or from a zip code I entered somewhere at some point.

The Quick View section learns what websites you visit frequently and provides links to them, essentially automatically creating bookmarks for you.

In all, an impressive effort, full of easy-to-use, convenient personalization features. It is a strong move toward personalization from Google and an aggressive attempt to get a constant Google presence on the desktop.

See also reviews from Gary Price and Nathan Weinberg.

Update: After using Google Desktop Sidebar for a few hours, I'm loving it. Definitely worth trying if you haven't already. I gotta say, I wish My Google worked more like this.

Update: In an AFP article, Google's Marissa Mayer says, "Google Desktop is a new, easier way to get information -- even without searching. You can think of it as a personal web assistant that learns about your habits and interests to identify and present web pages, news stories, and photos that it thinks you will be interested in."

Exactly right, Marissa. Personalization complements search. Search helps you find things when you know what you want. Personalization helps when you don't know what's out there. Personalization helps surface interesting things you didn't even know existed.

Update: In another article, Nikhil Bhatla at Google has a good quote about the Google Desktop Sidebar: "We've built a platform that lets people sit back and watch the Web come to them. Sidebar is automatically personalized based on what you do on the computer ... We wanted to make it work well for all types of people, both novices and advanced users. Novices won't go in and configure the Sidebar, so automatic personalization was important for them."

Friday, August 19, 2005

Froogle Mobile

Josh Redstone has the post on the Google Blog about Froogle Mobile, a new interface to Froogle for mobile devices.

If your cell phone doesn't support WML (mine doesn't), you can do price checks using text messages through Google SMS.

Josh Redstone is an old friend of mine from graduate school at UW CS. Congrats on the launch, Josh!

Wednesday, August 17, 2005

Yahoo Local and neighborhood pages

Brian Gil has the post on the Yahoo Search blog about the latest enhancements to Yahoo Local.

The city and neighborhood pages are pretty cool. Here's the pages for Seattle and for Capitol Hill in Seattle. The pages list some upcoming events, recommended restaurants, and other goodies.

For the moment, the reviews seem a little sparse, hurting the quality of the pages and limiting their credibility, but I'm sure that'll change with time.

Getting high quality, authoritative, reputable reviews is the key for local search. The Yellow Pages lacks any easy way to differentiate between the businesses listed. Whoever can provide trustworthy, useful, and comprehensive information and recommendations on top of business listings will win local search.

See also good and detailed writeups from Danny Sullivan and John Battelle.

Update: Yahoo also seems to be doing clickstream history and recommendations now in Yahoo Local. Look at a couple restaurants or other business listings, then look at the bottom of the city page. There's a section for Recently Viewed, Recent Searches, and "Based on your recent activity, you might like". The Yahoo recommendations seemed poor in the 3-4 Seattle restaurant examples I tried, but I'd hope that improves as they get more data.

Blog readers and RSS

Gary Price points to a Nielsen/NetRatings study (PDF) that says that "11 percent of Weblog readers, blog site visitors who claim to read blogs regularly or occasionally, use RSS."

The study goes on to break down the numbers and show that the majority of blog readers have never heard of RSS or don't know what RSS is.

If so many people who read blogs are uncomfortable with RSS, why do most feed readers focus on people cut-and-pasting RSS URLs?

See also my earlier posts, "XML is for geeks" and "Getting your grandmother to use RSS".

Monday, August 15, 2005

The walls on Yahoo's garden

The Economist for August 13 has an article titled "Yahoo's personality crisis". The article is subscription only, but here's a few excerpts:
The question is what Yahoo is ultimately planning to become that is different than its rivals .... Yahoo sees itself as a media firm ... [and] believes content will no longer be generated by a few large wealthy firms, but by users themselves.

There is a huge problem with this vision, however ... With so much content owned by Yahoo or generated within its site by users, the quandary for the firm will be: "Do you point people to your own stuff or to the most relevant stuff?" ... If Yahoo's users get the feeling they are being ushered to sites purely because they belong to Yahoo ... they will [leave].
This is basically a question of whether Yahoo goes for a closed, "walled-garden" experience -- as they and AOL largely have done in the past -- or embraces content and services available on non-Yahoo sites.

I'm more optimistic than the Economist that Yahoo will be open. Yahoo's recent features seem to trend toward embracing content from non-Yahoo sites. I suspect Yahoo will seek to have the best content on Yahoo, but will pull in good content from wherever on the Web it resides.

Thursday, August 11, 2005

A Findory feed reader

Like many others, we've been frustrated by the shortcomings of current RSS readers. They're slow. They're hard to set up. Once you finally set them up, they're a burden to use, painful clicking and skimming repeated again and again.

We wanted a better feed reader. It should be fast. It should be easy. It should help you find interesting articles.

So we decided to build it.

Try it out! Go to the Findory Favorites page. If you don't have any favorites feeds yet, add a few by clicking on source names (e.g. "BBC") and click the "Add Favorite" button.

It's fast, easy to use, and helps you find interesting news and weblog articles. It's what a feed reader should be.

On the left, you'll find favorite feeds with new articles are in bold and suggested other feeds listed below. In the middle, you'll see articles from the feed with recommended and new articles marked. On the right, you can read recommended articles from other feeds, an easy way to discover new news sources and articles.

But, wait, there's more. If you have at least three Favorites, you can see recommended stories from all your feeds combined by clicking "My Top Stories". Perfect for quickly surfacing the most interesting articles from your favorite sources.

I love this feed reader. It's fast, easy to use, and full of the discovery features I know and love at Findory. No one else has anything like it.

Try it! And please let us know what you think.

Update: We've had a couple requests for adding arbitrary feeds (by RSS URL) and importing OPML. Yep, absolutely! In fact, that's the main thing we're working on for the next few weeks. We're huge believers in launching early and often. We knew the current version would be exciting and useful to readers. So, we wanted to get it out there. More coming soon!

Update: We just added OPML import. Details at "Findory RSS reader, part II".

Wednesday, August 10, 2005

On the state of the blogosphere

Technorati CEO Dave Sifry has been posting a series of metrics about the "State of the Blogosphere". The biggest wow number is the 14M weblogs shown in this graph.

14M weblogs is a big number. But does it matter?

The important question is not how many weblogs there are, but how many useful and interesting weblogs there are. Many weblogs are spam, fake, or inactive. Readers don't care about these. Readers want useful news content. So, how many useful weblogs are there?

Dave did write about spam in his series, but he didn't provide hard numbers, so we have to turn elsewhere. Feedster folks have said that "at times we see upwards of 90% of the traffic from Blogspot being spam." Our experience at Findory is similar; the majority of new weblogs we see are spam or fake blogs.

Jim Lanzone from Ask was kind enough to post metrics from Bloglines, one of the most popular feed readers, on their view of the state of the blogosphere. And the Bloglines numbers tell a very different story from Technorati's numbers. Jim says there are about 1M interesting and useful weblogs.

In fact, the number is probably even lower. Since the 1M number Jim reports is the number of weblogs in Bloglines that have at least one subscription, the number of weblogs that are interesting enough to attract several subscribers is likely much lower, perhaps as low as 100k.

Nevertheless, the growth of the blogosphere is still impressive. While the number of useful weblogs may be one or two orders of magnitude lower than 14M number from Technorati, the Bloglines data graph still shows a clear exponential trend. Dave is right to talk about the importance of scaling in the face of this exponential growth.

But I'm not sure I agree with Dave on where the urgency is in scaling. Dave mostly talks about scaling Technorati to deal with this influx of data. Of course, users expect our services to scale to exponential growth in the blogosphere, and we need to do that. But that's not the urgent scaling problem for our users.

The real problem is scaling attention. Readers have limited time. As more and more blogs are available, it will become harder and harder to find and discover the gems buried in all the noise. We need to help readers focus, filter, and prioritize.

The real problem of scaling for growth of the blogosphere is not scaling the tools, but scaling the readers.

Mining the peanut gallery

I flew back from SES yesterday, so that means I got a chance to catch up on more of my reading on the plane.

Of the papers I plowed through, one of them is particularly fun, "Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews" by Kushal Dave, Steve Lawrence, and David Pennock.

The goal of the paper is pretty ambitious: Take all the reviews out there for each product and summarize them. Tough problem. Nasty natural language issues here.

But the payoff is big. This is something that would be quite useful, especially if some method of determining the credibility or the authority of each review was part of the process. People need help them differentiating between the vast number of products out there. Summarizing reviews could be a way of providing useful information quickly, much more easily than reading each individual reviews.

One thing that's great about this paper is that they detail their search through many different approaches to the problem, some simple, some more complicated. It is interesting that some of the most effective methods turned out to be fairly simple.

Another fun thing about this paper is the authors. Steve Lawrence was one of the authors of Citeseer. Kushal Dave and Steve Lawrence are now both at Google. David Pennock was at Overture and now is at Yahoo Research.

By the way, this summarizing reviews idea reminds me a bit of Newsblaster, the research project at Columbia that tries to automatically summarize news articles from many sources. If you haven't seen that yet, it's worth checking out.

Update: Gary Price wrote me to let me know about NewsInEssence, a news clustering and summarization research project out of U of Michigan.

Google's wildcard operator

Hiyan Alshawi posts on the Google Blog about Google's wildcard operator. It lets you do queries with missing terms like "lions weigh * pounds" or "Findory is about *".

Or, hey, appease your inner geek and use Google to calculate Fibonacci numbers.

[via Gary Price]

Tuesday, August 09, 2005

Google News publishes feeds

Google News finally publishes RSS feeds.

The Google feeds are similar to Yahoo News feeds, which are also available for news categories and news searches.

Findory remains the only place to go to get individualized, personalized RSS feeds, feeds that learn from the articles you read through the feed and adapt to your interests.

[via Chris DiBona and Charlene Li]

Saturday, August 06, 2005

Flickr and incremental development

Jesse Garrett has an interview with Flickr's Eric Costello.

Remarkable that an informal, agile, incremental development process gradually turned what was once intended to be a massively multiplayer online game into the Flickr photo sharing service we known and love today.

[via O'Reilly Radar]

Update: A year later, another interview on the same topic, this one in Inc Magazine with Stewart Butterfield and Caterina Fake.

Personalization for TV

Mark Cuban writes about the need for recommendations and personalization for TV:
We don't know what we want to watch as often, if not more often than we do know.

When we get to a point that there are thousands of on demand TV choices, we won't approach TV programming guides like we do a search engine, looking for a specific target. That's too much work.

The smart on demand providers will present their programming guides more like or Both of which do a great job of "suggestive programming."

We will get a personalized page with options that it thinks we might like based on our previous viewing decisions. Then different categories of shows, within each we will see best rated, most viewed and newest added, along with "play lists" suggested by branded guides who make recommendations.

All of these simple options will make it easy for us to make a choice with some level of confidence.
To search, you have to know what you want. Personalization helps when you don't know what's out there. Personalization surfaces interesting things you didn't even know existed, all with no effort, all with no work.

See also Chris Anderson's criticism of Mark Cuban's post. I'm not sure I understand Chris' counterargument, but he seems to be claiming that tyranny of choice easily can be overcome with more information. I don't agree, but Chris is a sharp guy, and it's always worth reading his viewpoints.

Update: Chris Anderson let me know that he updated and clarified his post. The last two paragraphs now argue that "infinite choice doesn't have to mean the tyranny of choice" and that people are generally happier with many choices. Still not sure I agree. At a minimum, infinite choice has a strong tendency toward tyranny of choice. As Mark said, people need help differentiating between large numbers of options and focusing in on just a few good ones.

Update: Chris wrote me again to say, "My broad point is that more choice along with ways to order that choice well is a good thing. The first without the second can indeed become oppressive." So, perhaps we do agree.

Thursday, August 04, 2005

ZDNet on Prabhakar Raghavan

Dan Farber at ZDNet has a lengthy post about his interview with Prabhakar Raghavan, the new head of Yahoo R&D.

A couple excerpts on personalization:
Search and creating more personalized user experiences that take advantage of underlying data and relationships is still in an infant phase. Yahoo, Google, Microsoft, Amazon and other major players understand that the spoils will go to those who provide answers, rather than links .... "Most people are not interested in search -- they want to get things done."

"Personalizing is a loaded word, and it sometimes gets trivialized. It's not about customizing the colors on the MyYahoo page," Raghavan said .... "You have to decide what content to show that users will find valuable, and not irritate users with too much content."
Google's Marissa Mayer said something similar a few months ago:
"We need to get better not at doing searches, but at providing answers people are looking for. There will be a day when ten HTML links regardless of who you are is not the answer any more."
Giving you the information you need, that's what personalized information is all about.

Findory in OSCON keynote

Tim O'Reilly reportedly ([1] [2]) mentioned Findory in his OSCON 2005 keynote. Thanks, Tim!

Free food at Google

Susan Wojcicki posts on the Google blog about the free food at the Google cafe. Sounds yummy.

I've often wondered why more tech companies don't have free food at their cafes. It seems to me that the math works out easily.

For example, when I was at, I remember being frustrated by waits as long as 10-15 minutes to pay for food at the Cafe. That means it costs the company $17-25 (assuming tech geeks have fully loaded costs of ~$200k/year, so $100/hour) to wait in line to pay for a $5-10 lunch. At least when the lines are long, it would have been cheaper for the company to give the food away for free than to pay for that lost time.

But, lost time aside, this benefit just isn't that expensive. If it costs $5-10 to provide lunch, that's $1250-2500 per year (5 days/week * 50 weeks/year * $5-10). Compared to costs (compensation, hiring, etc.) for tech geeks, this is a modest expense, equivalent to perhaps a 1-2% raise or a modest improvement in retention or productivity.

So why not just give people a raise? As Baron & Kreps point out in their book Strategic Human Resources, perks can be seen as a gift exchange, having an impact on morale and motivation disproportionate to their cost. Perks work better than cash.

Having the best perks in the industry magnifies this effect and likely is part of why Google has been so successful in poaching from other firms while avoiding losses themselves.

Investing in your people pays for itself. There are places to be frugal with money, but this is not one of them.

Update: Sixteen months later, Fortune Magazine names Google "the number one best place to work".

Wednesday, August 03, 2005 interview at OJR

Mark Glaser at OJR interviews Rich Skrenta and Chris Tolles of

It's a long and detailed interview, well worth reading. Here's some excerpts on Topix's advertising:
We started to pour in our categorization technology to the Google [AdSense] ads that were on our site, and it started working a lot better. We doubled the clickthrough rate on them. But beyond that, it made our site look better. Improving the quality of the advertising improved the quality of the entire product.

Google [ads] works great on a static Web page, but it's a disaster on most news sites and blogs ... [On] a story about a fire, the ad would be for a fire extinguisher or something like that. The famous case was when the New York Times' site had a story on a suitcase of body parts that washed ashore in New Jersey, and Google was showing luggage ads beside it.

Google will see 'Janet Jackson' 25 times in a story and put up an ad to buy a Janet Jackson CD next to the story. Our technology will see the same story and figure out it's actually a story about the FCC and indecency issues on the airwaves, and it's not actually an entertainment story. When you get that right, the 'buy her latest CD' ads go away, and ads for [FCC compliance] go up, and the clickthrough rate goes up, and the money you make goes up.

We have a bunch of artificial intelligence algorithms and a big knowledgebase, which basically is reading the words in every story. It's trying to figure out what the story is about ... it's all 100 percent automated.
It also appears is doing some simple targeting of advertising based on reader behavior:
[We] look at the cookie and see what pages they visited on Topix. And we could serve ads from that page rather than the worst-performing default ad that might be on the home page. It worked pretty well ... [Now] half or a third of the ads you see are relevant to something else you've looked at and not to what you're looking at.
Very interesting.

Yahoo targets Google AdSense

Yahoo just launched a closed beta of the Yahoo Publisher Network.

It is an advertising network that pays small and medium-sized websites to put an advertising block on their web pages. Similar to Google AdSense.

JenSense posts a lengthy review.

[via Gary Price]

Tuesday, August 02, 2005

Flickr, image search, and PhotoRank

Photo search can be a challenge. Text documents contain a lot of easily analyzable information about the content of the document. Not so with photos.

From what I've seen, there's three approaches to dealing with this problem. One is to use the text around the images, such as with an HTML document that has photos embedded in it. This is the technique used by the image searches provided by Google, Yahoo, and the other search giants.

Another is to have users provide labels for all the pictures. Flickr (which was recently acquired by Yahoo) is the best known and most successful of these. Users at Flickr "tag" photos with short keywords. But there are others, including the cute little ESP game by Luis von Ahn and others researchers at CMU.

A third approach is to analyze the images themselves to identify characteristics of the images and transfer descriptions and labels from similar images. To my knowledge, most of this work is still in the research stage.

Once you have data about the images, you can search, but there's still the question of how to order the search results, what you do for relevance rank. For images embedded in web documents, you might be able to use the PageRank of the document, but Flickr and other photo services have no web documents associated with the photos.

Stewart Butterfield at Flickr just announced a new feature he calls "interestingness" and John Battelle calls "PhotoRank". It sounds like it cleverly uses the data from the Flickr community to do relevance rank:
Interestingness is a ranking algorithm based on user behavior around the photos taking into account some obvious things like how many users add the photo to their favorites and some subtle things like the relationship between the person who uploaded the photo and the people who are commenting (plus a whole bunch of secret sauce).
Mmm... Secret sauce. Seriously, I'd love to hear the details behind this. Doing relevance rank for photos like this is a hard problem. It sounds like Flickr has a great idea for how to help people search for interesting photos.

Update: Brian Dennis posts some thoughts on "interestingness" and links to several discussion threads on the feature.

Monday, August 01, 2005

Topix was valued at $64M

When was acquired back in March, there was some confusion about their valuation.

Bambi Francisco at CBS Marketwatch first said that "the funding was less than $5M" but later reported that the valuation was between $50-100M.

Rafat Ali at looks up the details in the latest SEC filings and concludes that the three newspapers paid $48-50M for their 75% stake.

Jeff Clavier summarizes that was valued at "around $64M" for a company "that closed 2004 with about $1M in revenues". offers a clever service with automated, fine-grained categorization of news articles crawled from sources worldwide. They are particularly strong in local news, with categories even for small cities like Sequim, WA. also has some useful experience with targeting advertising to news.

Congratulations to Rich and the team!