Friday, June 30, 2006

Findory at SIGIR AIRWeb

I consider myself very fortunate to be on the panel at the Adversarial Information Retrieval on the Web workshop at SIGIR on August 10.

Others on the panel include Jeremy Hylton (Google/Python), Andrew Tomkins (Yahoo Research), Natalie Glance (BlogPulse/Nielsen BuzzMetrics), and Dennis Fetterly (Microsoft Research). It should be interesting!

Thursday, June 29, 2006

Google Checkout, Amazon, and eBay

It is being widely reported that Google launched their payment system, Google Checkout.

Google Checkout is a system for handling credit card processing between small merchants and consumers. At launch, it has been integrated with over a hundred online stores.

A couple reviews ([1] [2]) see this as a bigger threat to Amazon than eBay's PayPal. For now, I do not agree.

Amazon's value to small merchants is in taking over the online store. Amazon lists third-party merchandise, handles the entire online shopping experience including search, checkout, and payment, and provides inventory management and accounting tools to the merchants. For some merchants, Amazon even warehouses the inventory and does the shipping.

Google will get there at some point. If they integrate Froogle with Google Checkout and build some additional merchant tools for accounting and inventory management, that would have a good start. But, they are not there yet.

For now, I think eBay's PayPal has the most to lose from Google Checkout. Small merchants who turned over their credit card processing to PayPal in the past will now consider switching to Google Checkout. PayPal's person-to-person business segment will not be at risk, but their business-to-consumer will.

Google Checkout appears to be the same as the rumored GBuy and Google Wallet that we heard about a while back. Charlene Li says that GBuy was the internal code name for Google Checkout.

See also my May 2004 post, "Google vs. eBay?", where I talk about what it would take for Google to get into e-commerce, and my Feb 2005 post, "Google and eBay pursue sellers".

As for how Amazon and eBay should respond to Google Checkout, see my previous post, "Yahoo and eBay, Amazon and Microsoft".

Update: An insightful post by Om Malik looks at Google Checkout as a step toward cost-per-action advertising (instead of cost-per-click).

Tuesday, June 27, 2006

Four petabytes in memory

One way to look at how the Google cluster changes the game is to look at how data access speeds change with a cluster of this size.

Google reportedly had an estimated 450k machines two months ago and adds machines at roughly 100k per quarter. In 2004, each of these machines had 2-4G of memory, and, two years later, likely are up to 8G standard.

That means that Google can store roughly 500k * 8G = 4 petabytes of data in memory on their cluster.

Four petabytes. How much is that?

It is twice the size of the entire Internet Archive, a historical copy of much of the Web. It is the same as the estimated size of all the data Google had in 2003 on disk.

It is a staggering amount of data. And it is all accessible at speeds orders of magnitude faster than those with punier clusters.

That is a game changer.

Valdes-Perez on the future of search

Raul Valdes-Perez (CEO of Vivisimo) is interviewed by Mark Roth of the Pittsburgh Post-Gazette. Some selected excerpts:
Where will search go in the future?

Many of the major companies are now investing in "personalized search," [Dr. Valdes-Perez] said, using a record of which sites a user has visited to help tailor their future searches .. [and offer] direct answers to people's questions ...

A big challenge there, he said, is to design a system that will understand what kind of information a person wants about a particular subject.

In general, an ongoing goal of search is to make it "smarter" so we can be "dumber" -- to develop software that will do the hard thinking for us.

"There's a quote I used to have on my home page at CMU that says 'Society advances by reducing the number of tasks that require thought.'

"You can accomplish much more by applying your thinking time to higher level stuff."
When I was at Amazon working on personalization, we used to joke that the ideal online bookstore would be very simple. It would display a giant picture of just one book, the next book you want to buy.

Yes, that is a bit too much. But, I think the point Raul makes is clear. People do not want to spend time thinking about searching. They want to get what they need and go.

The prize in search will go to those that help people get what they need quickly, effectively, and effortlessly.

Monday, June 26, 2006

Foundations of Statistical NLP

Every once and a while, I come across a book so good that I feel foolish for not reading it sooner.

"Foundations of Statistical Natural Language Processing" by Chris Manning and Hinrich Schutze is a remarkable survey, not only in breadth, but also in its deep, critical analysis of each technique's strength and weaknesses.

I was particularly excited by the focus on practical approaches with massive data sets.

For example, when discussing clustering, the authors warn of efficiency issues with hierarchical clustering or EM algorithms and say that K-means "should probably be used first ... because its results are often sufficient."

On text categorization, they talk about other techniques, but then point out that k nearest neighbor (kNN) is a "simple method that often performs well."

When discussing latent semantic indexing (LSI) and other forms of dimensionality reduction, they mention that pseudo feedback -- query expansion by adding terms from the top results for a search for the original query -- can be cheaper and more effective, depending on your needs.

They criticize hidden Markov models (HMMs) because of the large state spaces required to model many real-world problems, but discuss variants that try to mitigate this issue. They then follow up when talking about part-of-speech tagging by offering transformational (rule-based) approaches as a fast and effective alternative to HMMs.

The book also is full of examples of the ambiguity of language, especially in the sections on disambiguation, parsing, and machine translation. The authors tease you with what looks like a problem with a simple solution, then offer examples that tear apart your naive attempts at cleverness.

Though they focus on very large data sets, Manning and Schutze do not see massive data as the entire solution. At one point, when talking about N-gram models, they say:
One might hope that by collecting much more data that the problem of data sparseness would simply go away ... In practice it is never a general solution to the problem. While there are a limited number of frequent events in language, there is a seemingly never ending tail to the probability distribution of rarer and rarer events, and we can never collect enough data to get to the end of the tail.
Despite the huge amount of text out there, not everything that can be said already has been said.

A great book. I only regret that I took so long to read it.

Update: Some good discussion in the comments for this post. Don't miss the link to Peter Norvig's book reviews.

Friday, June 23, 2006

Findory in PC World

Findory is reviewed and compared favorably to Google News in a PC World article, "Web News Wranglers", by Ryan Singel. Some selected excerpts:
So much news on the Web, so little time to read it all.

The torrents of blog posts and news feeds on today's Internet hold way too much data to keep up with if you just browse the Web normally. Fortunately, help is here in the form of sites that filter the news for you with ever-increasing efficiency ....

[Some] sites take a more subtle approach to tuning news to your preferences by watching the stories you read. Google News may be the best known of these ... the site will monitor your news habits and Web searches, and tweak the articles it displays on its algorithmically generated Google News page. The personalized changes are slow to come and aren't explicitly marked, but the site does a remarkable job of highlighting the latest, most important news stories without human editors.

Findory uses a similar tactic ... [and] in surprisingly little time the site replaces its default, very scattershot array of stories with a lineup tailored to your interests. Findory gets better as you continue to use it, adding little symbols next to recommended stories.
In a breakout box a little further in the article, Ryan compares Findory directly to Google News. Ryan says, "[Google's] recommendation engine seems less intelligent and transparent than Findory's" and "Findory's recommendation system works rather nicely." Then, at least in terms of the quality of the personalization, Ryan awards the "edge" to Findory over Google News.

The article also discusses many other news sites, including Netvibes, My Yahoo, Rojo, Spotback, Bloglines, Digg, Memeorandum, TailRank, and Newsgator.

Thursday, June 22, 2006

Recommending Orkut communities

An interesting paper, "Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network" (PDF), explores generating recommendations in Google's large Orkut social networking site for "related communities" using collective past behavior.

The paper does a great job of motivating the use of recommendations:
Online information services have grown too large for users to navigate without the help of automated tools such as collaborative filtering, which makes recommendations to users based on their collective past behavior.
They go on to say search and browse "were labor-intensive" in a sea of 1.5M Orkut communities. The Googlers looked to recommendations to help people discover new communities.

There is an interesting and detailed evaluation of several similarity metrics. L2 (aka cosine distance) had the best performance in their tests:
We were surprised that a total order emerged among the similarity measures and that L2 vector normalization showed the best empirical results despite other measures, such as log-odds and pointwise mutual information, which we found more intuitive.
Unfortunately, the paper leaves generating truly personalized recommendations -- different recommendations for each user -- to future work. They say:
All members of a given community ... see the same recommendations when visiting that community's page .... [But] just as we can estimate communities' similarity through common users, we [could] estimate users' similarity through common community memberships: i.e., user A might be similar to user B because they belong to n of the same communities.
This would be where things get really interesting, building a different page for every user.

On a personal note, Ellen Spertus, the first author on the paper, was a visiting scholar at U of Washington in the AI group when I was there. Small world. It is fun to bump into her work again.

Update: The video of a June 21, 2006 talk given at Google Kirkland by Ellen Spertus on this work is now available.

MSN losing market share

The latest comScore numbers on MSN show the Redmond giant is losing ground, down to 12.9% of online searches from 15.3% a year ago.

Paul Kedrosky comments that it is "remarkable watching how darn poorly MSN is doing" and that this may be "the real reason for continuing turmoil at MSN."

Tuesday, June 20, 2006

Trust, distrust, and web spam

An interesting WWW 2006 workshop paper, "Propagating Trust and Distrust to Demote Web Spam", surveys and expands on the ground covered by TrustRank.

The paper starts by describing TrustRank and some of the related work. It then talks about a similar mechanism for propagating distrust -- working from a blacklist of known spam sites instead of a whitelist of known good sites -- which they refer to as "BadRank". They end by experimenting with a few alternative methods of propagating both trust and distrust through the link graph that may have better performance than TrustRank.

A worthwhile read if you have any interest PageRank, TrustRank, and the efforts to fight off web spam.

Sunday, June 18, 2006

Web acceleration without prefetching

I really like this paper, "Prefetching the Means for Document Transfer", by Edith Cohen and Haim Kaplan.

The paper discusses ways to accelerate web browsing doing everything short of actually prefetching the page. I thought it was particularly cool that significant performance wins could be found just by preemptively doing DNS lookups for all the probable transitions from a given page.

Given all the problems with prefetching that overly clever applications like Google Web Accelerator discovered, partial prefetching -- doing the DNS lookups and connecting to the webserver, but not prefetching the content -- is a clever idea.

[via a paper by Brian Davison]

Thursday, June 15, 2006

Netscape's news scrape

It is being widely reported that AOL just released a beta version of an clever news site at

I like how the new site seems to have applied some of the best lessons from many other efforts. It has fully automated content and placement like Google News. It has voting like Digg. It shows most popular and most highly ranked stories like Yahoo News and CNN. It shows related stores like BBC. It includes tags like It has user moderated comments like Slashdot. It uses human editors to influence story selection and reduce spam, like Slashdot, Yahoo News, and all newspapers.

One lesson it did not seem yet to learn is one that Findory could have taught it. Most popular lists tend toward the sensationalistic and frivolous, as you can see from the most popular headlines on Yahoo News. As the site grows, this will become an increasing problem.

What is interesting to me is not the same as what is interesting to the hordes of teenagers out there, but it is similar to what interests some other tech and business geeks. Most popular doesn't cut it. What I want is most popular and most interesting for people like me.

Overall, I am impressed by what AOL has produced on It is a blend of clever features created by people who were obviously paying close attention to the lessons and mistakes of other news sites.

See also Danny Sullivan's review on Search Engine Watch. Danny is much more negative on this one than I am, calling it a "brave attempt" but also reviewing parts of it as "lame" and "confusing".

Update: A couple weeks later, Nicholas Carr savages the new, saying, "Normal people seem to think the entire concept is ludicrous," and "The best way to prove that a niche product is a niche product is to toss it into the mainstream and let it sink."

Tuesday, June 13, 2006

Findory launches video and podcasts

Findory added two new product lines this evening, video and podcasts.

The new product lines work just like personalized news and blogs. Watch videos you see on Findory and the site will surface other interesting videos. Listen to podcasts and Findory will help you discover other interesting podcasts.

There are vast quantities of videos on the Web, clips from the big studios, indie shorts from small producers, little animations by college students, and random snippets from someone's camcorder. There are huge numbers of podcasts, from recordings of the radio programs of the BBC to individuals talking into a microphone in front of their computer at home.

With everything that is out there, it is hard to find the good stuff. There are inspired pieces that have never found an audience. There are university lectures that might teach you. There are parodies that might make you laugh.

Findory surfaces the rare gems that otherwise would be lost in the noise. Findory helps people discover things they would not have found on their own.

Findory Video and Findory Podcasts should be considered early beta. They will expand, develop, and grow with time.

Sunday, June 11, 2006

WSJ on recommendation engines

Jason Fry at the Wall Street Journal looks "Under Recommendation Engines' Hood". I am briefly quoted in the article.

I particularly like this excerpt on whether recommendations should target what you think you want or what you really want:
The goal of Amazon's recommendations is simply to generate more sales than whatever would have been at the top of a customer's page without personalized information.

It's bizarre to expect Amazon to ignore the trainloads of "Star Wars" stuff I've bought from it, or to somehow detect that I thought all those Boynton books were insipid even as I kept buying them. When it comes to describing us as customers and consumers, recommendation engines may do the job better than we would.

In other words, we lie -- and never more effectively than when we're lying to ourselves ... I fancy myself a reader of contemporary literature and history books, but I mostly buy "Star Wars" novels and "Curious George" books for my kid.
Implicit data like purchases may be noisy, but it also can be more accurate. You may say you want to watch academy award winners, but you really want to watch South Park. You may say you want to read Hemingway, but you really want to read User Friendly.

The best guide to what you will want is what you wanted.

Saturday, June 10, 2006

GFS, MapReduce, and Hadoop

If you are interested in Google's GFS and MapReduce, don't miss Hadoop, a GFS and MapReduce clone being developed by the same folks who wrote the open source text search engines Lucene and Nutch.

Update: Doug Cutting gave a talk, "Scalable Computing with MapReduce" (PDF) at OSCON 2005 that discusses Hadoop (called NDFS in the talk) and how it helped Nutch scale its crawl to billions of pages. [via Sergey Chernyshev]

Talk on large systems at Google

The audio and video quality is horrible, but this Google talk, "Building Large Systems at Google", by Googler Narayanan Shivakumar (aka Shiva) has some interesting tidbits about GFS, MapReduce, BigTable, and Google's infrastructure.

Shiva starting by talking a bit about the Google Kirkland office and some of the projects launched out of that office.

Around 13:00, after talking a bit about GFS, some questions from the audience pick at the vulnerability of the system to failures of the master nodes. Interesting point there.

Around 26:00, Shiva talks about the motivation behind BigTable. Shiva says they tried commercial databases and found them insufficient for their needs. The system they built, BigTable, is a giant, persistent map (huge database of key->value pairs) running across a large cluster.

Around 40:00, Shiva makes the interesting statement that, in addition to lower cost for higher performance, a benefit of building a cluster on unreliable hardware is that it keeps programmers from "being lazy." He didn't go into this in depth, but I assume the argument here is that more expensive, high availability hardware fails less often, but still fails sometime, so it is better to just accept that hardware will fail and design your systems to survive failures. This advice conflicts somewhat with what Shiva said earlier in the talk when he claimed that a good way to work around the issue of the vulnerability of GFS master nodes is to use higher availability hardware.

Around 43:00, Shiva touched on the point that Google is most interested in the performance per watt characteristics of its hardware since that seems to be the primary constraint in their data centers. For more on that, see my earlier post, "Power, performance, and Google".

On a personal note, the camera never turned to the audience, but I swear that was UW Professor Ed Lazowska who asked several questions during the talk. Either that, or someone who could be his voice double.

Tuesday, June 06, 2006

Standing interests, search, and recommendations

Googlers Beverly Yang and Gleh Jeh presented a paper, "Retroactive Answering of Search Queries" (PDF), at WWW 2006 that discusses an interesting idea around recommendations for web search.

The authors built a prototype that "retroactively answers queries from a user's history as new results arise." It is a little like automatically constructing Google Alerts for searchers and informing each person of any interesting, new results. Yang and Jeh say that they want to "focus on and address known, specific user needs" by "making web page recommendations corresponding to specific past queries."

The system itself is pretty straightforward. "Standing interests" are determined by queries that have a lot of activity (e.g. > 8 clicks on search results, > 3 refinements of the query, and repeated searches on the query). For each standing interest, the prototype tries to find new results with high PageRank for the query, then presents those results as recommendations.

One interesting result from the quick user study described in the paper was that rank -- how high the search result appears on the search result page -- was inversely correlated with perceived recommendation quality.

I think this may be best explained by people wanting the recommendations to help them discover things they would not have found on their own. The best recommendations are new, surprising, and useful. If I can find it easily myself because it shows up high on search results, that is not new, surprising, or useful.

The paper does not explore using data about what other people have found for the recommendations. That is unfortunate. In my experience, the most surprising and useful recommendations come from other people.

The paper does focus quite a bit on the problem of identifying these "standing interests", queries that a particular user does again and again. This seems similar to building a user subject interest profile (e.g. I like geeky computer and business stuff), but much more specific. As the authors briefly point out, the "automatic identification of standing interests in the form of specific queries can be especially valuable in ads targeting."

Beyond PageRank, RankNet, and fRank

Matthew Richardson, Amit Prakash, and Eric Brill from Microsoft Research presented a paper, "Beyond PageRank: Machine Learning for Static Ranking" (PDF), at WWW 2006.

The paper describes a neural network-based ranking function, fRank, that they trained to determine the importance of various features for relevance rank of web search results.

The authors show that naive PageRank is no longer effective for relevance rank, producing the worst results in their tests and offering ranking no better than ranking solely based on features of the domain of the page.

That result is not surprising. Naive PageRank has been under assault by spammers for many years and is no longer effective.

This is not to say that all information extracted from the link graph is useless, but that it must be adapted to the deluge of spam using techniques such as those explored in TrustRank.

Unfortunately, this paper does not explore that issue, which makes it hard to interpret their findings. I am left uncertain whether they have found a significant improvement in relevance rank, as they seem to claim, or if they merely knocked over a strawman.

By the way, despite having no authors in common, this paper is closely related to the MSR RankNet work described in a 2005 paper "Learning to Rank using Gradient Descent" (PDF). I talked about that paper in painful detail in a previous post.

Survey talk on personalizing web search

Paul-Alexandru Chirita gave a nice survey talk, "Current Approaches to Personalized Web Search" (PDF), at the Future of Web Search Workshop.

[via Gary Price]

Andrei Broder on 4th generation search

Andrei Broder from Yahoo Research gave the keynote talk, "From query based Information Retrieval to context driven Information Supply" (PDF), at the Future of Web Search Workshop.

I particularly like how Andrei describes some of the features of the current, "third generation" of web search engines as advancing over the previous generating by focusing on the "need behind the query." Examples include the Yahoo Shortcuts that appear on a search for [Britney Spears] and the inlined local results for a search for [dentists bronx].

Andrei then goes on to describe a future "fourth generation" of web search where search engines will no longer require explicit queries, but do "active information supply driven by user activity and context". As examples, he gives Yahoo Alerts and personalization and recommendations. He also says advertising will stop being irrelevant and annoying and start focusing on supplying interesting and relevant information targeted to each person's needs.

[via Gary Price]

Friday, June 02, 2006

Cyc talk at Google

I have been following the Cyc project, an attempt to build and use a massive database of common sense knowledge, off and on for a decade or so.

So, I was excited when I saw a video of Douglas Lenat's recent talk on Cyc at Google, "Computers versus Common Sense". It is long, but it is full of interesting examples, definitely worthwhile if you have any love for geeky AI stuff.

If you are already familiar with the Cyc project, you still might want to check out the talk at 31:30 and 41:00.

At 31:30, Douglas talks about how they deal with things being partially true or conflicting assertions in the database. They do not use probabilities, but instead allow statements to be consistent in local contexts while inconsistent globally. They also almost never use a formal theorem prover, instead preferring a large set of much faster heuristics for reasoning.

At 41:00, Douglas talks about automatically learning knowledge from the Web. Douglas argues that understanding the natural language text on the Web requires starting with a large handcoded database of common sense knowledge. After building that seed database manually, it can be used to automatically extract additional knowledge from the Web.

On needing a manually constructed seed database of knowledge, I suspect fans of statistical NLP might be quick to disagree. But, Douglas did have some compelling cases that would trip up statistical techniques. For example, no one on the Web ever writes that water runs downhill, but they do write that water runs uphill (as a metaphor).

If you are interested in more, you might check out OpenCyc and the Cyc publications.

See also the Verbosity project (PDF) that I mentioned in an earlier post.

See also my previous post, "AI and the future of search".

Broken sorts

Googler Joshua Bloch reports that "Nearly All Binary Searches and Mergesorts are Broken" due to an integer overflow bug. Awesomely geeky.

By the way, "Programming Pearls" really is a great book. It's full of fun little algorithm problems. Some might consider it old school due to the focus on C and low-level performance optimizations, but, hey, when optimization is important, it is important.

Thursday, June 01, 2006 blog search and Bloglines

Gary Price has a good post about the new blog search from

The search has also has been integrated into Bloglines. The Bloglines search has been suffering from timeouts and infrequent updates for months, so this new version is very welcome.

Ask is doing something clever with their blog search, using user behavior data from the most popular feed reader, Bloglines, in the relevance rank of the blog search results. This allows them to figure out which feeds are most popular, exclude spammy and useless blogs, and focus attention on popular posts.

Filtering out crap is particularly important in the blogosphere. As Jim Lanzone said, as of October 2005, only 37k of the supposed 20M weblogs out there had substantial interest (more than 20 subscribers) on Bloglines. Interesting blogs are fair outweighed by the uninteresting. Focusing attention on the useful and worthwhile is critical.

By the way, do not miss the advanced search on Bloglines. It allows goodies like limiting searches to only your Bloglines subscriptions. It appeared a little buggy in my usage -- oddities like the search options changing inexplicably or sticking inappropriately on future quick searches -- but it could be really nice when they get those problems ironed out.

See also Chris Sherman's review.

Yahoo Video, metasearch, and grandma

Ethan Fassett has the post on the Yahoo Search blog announcing that Yahoo Video now allows uploading of videos, like YouTube and Google Video.

The new product is an extension of their existing Video search that crawls videos on other sites. As a nice side effect of this, Yahoo Video has content right now, even before they get a lot of videos uploaded to their system. Yahoo took a similar approach in Yahoo Local; restaurant reviews entered on Yahoo are supplemented by restaurant reviews crawled from the Web.

A big problem with Yahoo Video is that it does not convert all the videos to Flash. As Paul Boutin said:
The guys behind YouTube hit the sweet spot ... They made it head-slappingly easy to publish and play video clips by handling the tricky parts automatically.

YouTube's servers convert your vid to a standardized format, but you don't need to know what that format is. If you send the URL to your aunt, it'll play in her browser without spraying the screen with pop-ups and errors.

There's no software to install, no settings to muck with. The video auto-plays as soon as you load the page, without launching more windows.
Simple enough that grandma can use it. That's the sweet spot.

See also Michael Arrington's review.