Geeking with Greg: 10/01/2007

Wednesday, October 31, 2007

Personalized search for movies

Seung-Taek Park and David Pennock from Yahoo Research had a good paper on personalized search at KDD 2007, "Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing" (PDF).

While the paper focuses on personalized search for movies, the techniques discussed are applicable to other types of search.

The authors start with some motivation that probably sounds familiar to readers of this blog:

Recommender systems are widely used ... to overcome information overload ... Information retrieval systems list relevant items ... only if a user asks for it ... Recommender systems predict the needs of a user ... and recommend items ... even though the user does not specifically request it.

We build a prototype personalized movie search engine called MAD6 ... MAD6 combines both information retrieval and collaborative filtering techniques for better search and navigation.

I like the technique MAD6 uses for personalized search. They use an "item-based collaborative filtering algorithm to calculate a user's expected ratings" on items in search results, fill in any gaps with average ratings from the general population, then re-rank the items.

For example, if a searcher rated Terminator and Terminator 2 highly, the personalized search results would first order the search results by relevance to the search terms and popularity, then re-rank Terminator, Terminator 2, and anything related to those two movies higher in the search results. In the example in the paper, this resulted in the top 5 search results for a query for [arnold action] being Terminator 2, Commando, True Lies, Last Action Hero, and Terminator.

As the authors report, this order was significantly different than the norm for a search for [arnold action] on other search engines. In their tests, they found their personalized rank performed very well on navigational queries -- when people already know what they are looking for -- but not as well on less directed informational queries.

The paper explains why (where GRank is a general rank that orders by popularity, PRank is the personalized search, and Web is a Web search):

When navigational queries are submitted, participants are more satisfied with PRank and Web than GRank. However, when informational queries are submitted, participants prefer GRank rather than PRank and Web.

One possible explanation is that, when participants submit navigational queries, they may have very clear target movies in their minds. These movies may be their favorites and are more likely rated before the test.

However, when informational queries are submitted, participants may not have clear target movies and [fewer] returned items ... [may] be rated ... Then ... the item-based algorithm may be inaccurate due to the lack of user information ... The item-based algorithm suffers from a cold start problem. We believe users' satisfaction of PRank will increase as users provide more ratings.

This result is unfortunate. A goal of recommender systems is to enhance discovery of unfamiliar items. If PRank is performing poorly on informational queries, it is failing at this task.

This result is surprising to me though. It should be possible to tune PRank to only modify the rankings when it has sufficient evidence that the change would be an improvement, otherwise falling back to GRank. First do no harm. PRank should only make a change when the majority of people will see the change as an improvement.

More generally, it should be possible to tune the recommender to favor serendipity and enhance discovery in informational queries while also supporting re-finding in navigational queries. Serendipity largely reflects the amount of surprise in the recommendations -- pushing away from the popular and toward the unusual -- while re-finding is merely surfacing or annotating items seen before. It should be possible to do both.

A very interesting paper and a worthwhile read. I love the approach of layering an item-based recommender on top of search results to create a form of personalized search (Findory made an attempt ([1] [2]) at doing something similar in web search). By looking at specific actions, not only can the personalized search act at a finer level of detail than Google Personalized Search ([1] [2]), but also it can adapt immediately to short-term trends, what you are searching for right now.

By the way, let me note that the PDF for this paper used to be publicly available, but appears to have been pulled. It now only is available with KDD or ACM Digital Library membership. I often do not write about papers that cannot easily be downloaded, but this one is sufficiently interesting that I wanted to make sure people knew about it.

Also, this paper, as many others, cites Sarwar et al., 2001 as the first work on item-based collaborative filtering. As I have said before, that may not be accurate.

Update: Seung-Taek Park in the comments gave an alternative location (PDF) for downloading the paper. I changed the link at the beginning of this post to point directly to the PDF file. Thanks, Seung-Taek!

Monday, October 29, 2007

Google Tech Talk on similarities

Yury Lifshits gave a Google Tech talk, "Similarity Search: A Web Perspective", surveying various algorithms for finding similar items. Slides (PDF) from the talk are available.

In the talk, Yury mentions his tutorial, "Algorithms for Nearest Neighbor Search", which should be of interest to those who want to dive in deeper.

Thursday, October 25, 2007

Facebook and the value of blocking Google

In his post, "The $15 billion nonsense", Nick Carr nails it:

Extrapolating Facebook's true worth from Microsoft's investment is a ridiculous exercise ... The investment ... was a price Microsoft had to pay to nail down the partnership.

Partnering ... is far more about gaining future strategic options and blocking the advance of ... Google ... than about making a financial gain.

This was $240M to gain share for Microsoft's advertising platform at the expense of Google and Yahoo. It says nothing about the market value of Facebook.

Update: Mike Masnick reports on rumors that some crazy hedge funds might have gone in at the same $15B valuation for Facebook. As Mike says, "Those hedge funds don't get any of those additional benefits that Microsoft gets." If true, this news would support a $15B valuation for Facebook, at least if you believe those hedge funds are responsible stewards of their investors' money.

Update: John Battelle says, "I think no one ... has truly grokked what Facebook has a shot at doing - Adsense driven not by search queries, but by personal profile." Call me a skeptic on this one. I think personalized advertising will produce much higher returns by reaching back to the last action with purchase intent than by targeting a coarse personal profile.

Tuesday, October 23, 2007

Searchers say, please read my mind

Greg Sterling at Search Engine Land reports on a Kelton Research poll that found considerable frustration among search engine users and a desire for a search engine that can "read their minds":

72.3 percent of Americans experience "search engine fatigue" when researching a topic on the Internet ... More than three out of four (75.1 percent) of those who experience search engine fatigue report getting up and physically leaving their computer without the information they were seeking.

Kelton asked survey respondents whether they wished that search engines like Google could, in effect, read their minds, delivering the results they were actually looking for. . . That capability is something that 78 percent of all survey-takers "wished" for.

As Greg Sterling says, this sounds like a common hunger for personalized search.

See also a March 2005 post where, in response to Udi Manber statement that "we are not in the mind reading business," I said, "If you need to read minds ... well then you better read minds. [Searchers will] think it's your fault, not theirs, if you don't give them what they need."

Monday, October 22, 2007

The Web titans arrive in Seattle

Todd Bishop at the Seattle PI writes about the branch offices Yahoo and Google have opened in Seattle. Some excerpts:

Yahoo's plan to put a large engineering outpost in Bellevue -- down the road from Google's Kirkland branch and Microsoft's Redmond campus -- means that all three giants of Internet search and Web services will have large offices in the region.

Added to other existing players -- such as Amazon.com, RealNetworks, InfoSpace and Adobe's Fremont branch -- they promise to turn the region into more of a center for Web-based technology.

I am quoted in the article with some thoughts on the impact to technology firms in Seattle and computer science at the University of Washington.

Friday, October 19, 2007

Google News, Krishna Bharat, and RecSys 2007

Google researcher and creator of Google News Krishna Bharat just gave the keynote talk at the Recommender Systems 2007 conference.

The talk had a pleasantly idealistic focus on increasing access to knowledge. Krishna clearly sees helping people find news information as a noble and important mission.

Krishna devoted most of the early part of the talk to discussing the history of writing and information broadcast, ending with the claim that the Web was creating a change in news consumption and access equally revolutionary to radio and television.

This revolution comes from universal and easy access to news and lower costs of producing news. Krishna saw this as having a very broad impact, saying, "The Internet can (and will) do a lot for democracy."

Even so, Krishna warned of challenges, saying, "Technology enables free speech but doesn't guarantee it," and expressing concerns about censorship. He said the goal of Google News was to ensure the spread of knowledge, multiple perspectives, and differing opinions.

To provide multiple perspectives, Google News crawls a broad list of sources, ranks and clusters them, then explicitly exposes the clusters to readers. That makes it easy for people to see the difference in, for example, how a hostage crisis is covered in South Korea and Pakistan.

The clustering attempts to group stories on the same event together. Krishna made the interesting comment that the clusters will change with time, with old and new stories shifting clusters as follow-up stories on an event appear. They use a technique Krishna only broadly described as an agglomerative hierarchical clustering algorithm.

Krishna provided more details on how Google determines the relevance of stories and authority of sources. He started by describing how human editors determine the relevance of stories, a long list that included scope/impact, urgency, lack of negativity, unexpectedness, lack of ambiguity, the "human element", ability of the audience to identify with the story, elite (e.g. celebrity) references, consonance, continuity, market forces, local bias, and ideological bias.

Krishna then said that Google determines article relevance by looking at the authority of the source, timeliness of the article, whether it is an original piece, placement by the editors on the source page, the apparent scope and impact, and the popularity of the article.

A big piece of determining the relevance of a story is determining the authority of the source. Google estimates that by looking at the characteristics of all the articles produced by the source (including number of non-duplicate stories, length of the articles, breadth of the articles, number of important/breaking stores, click rate by Google News readers, and the average quality of the writing), PageRank of the news website, and real world data on the news company (e.g. number of employees).

Krishna did see personalization and recommendations for news as a long term goal, saying we want to "get the right news to the right audience." And, Krishna has been interested in this for a long time, all the way back to his 1995 work on the Krakatoa Chronicle. As for more recent work, Krishna summarized the WWW 2007 paper, "Google News Personalization".

Overall, Krishna focused on the Google's mission of making information universally accessible and useful. He clearly wants to help people find news and be informed about world events, using whatever tools, personalization or otherwise, serve that mission.

Predictive accuracy is not enough

Sun Labs researcher Paul Lamere posts a list of properties of a good music recommender system:

A good recommendation has three aspects:

familiarity - to help us gain trust in the recommender
novelty - without new music, the recommendation is pointless
relevance - the recommended music has to match my taste

The emphasis on credibility is insightful. Recommendations that are too obscure may be perceived as low quality because the user cannot easily and quickly evaluate them.

This is just one of many examples of the yawning gap between predictive accuracy -- accurately predicting what people want to buy next -- and perceived quality -- the usefulness of the recommendations to users.

Saturday, October 13, 2007

Advances driven by the future of computing

UW CS Professor extraordinaire Ed Lazowska gave a talk recently, "Computer Science: Past, Present and Future", that covered some of the large new opportunities will be created from advances in computing in the next ten years.

What I found most insightful in the talk was Ed's discussion of how many fields will be going to go from data poor to data rich, a process driven by cheaper sensors and the availability of massive parallel computing power to analyze, summarize, and understand the incoming flood of sensor data. He specifically held up oceanography, medicine, astronomy, and biology as fields that will be forever changed by this shift.

If you are short on time, the slides (PDF) are available from a similar talk Ed gave at FCRC 2007. Slide 37 summarizes the major opportunities and some details on those opportunities start on slide 46.

Update: Oops, sorry, the video from Ed's talk is not available yet. It should be available in just a few days.

Update: The video from the talk is now available.

Attending RecSys 2007

I will be at the ACM Recommender Systems 2007 conference Oct 19-20 next week in Minneapolis. If you will be there, please say hello if you see me.

Wednesday, October 10, 2007

Highly available distributed hash storage from Amazon

Amazon CTO Werner Vogels announces an awesome SOSP 2007 paper he co-authored with nine other people at Amazon, "Dynamo: Amazon's Highly Available Key-value Store" (PDF).

Like the Google File System and Google Bigtable papers, this Amazon paper describes a critical part of Amazon's infrastructure, a reliable, fast, and scalable storage system.

Some excerpts from the paper:

There are many services on Amazon's platform that only need primary-key access to a data store ... [and] using a relational database would lead to inefficiencies and limit scale and availability.

Dynamo uses a synthesis of well known techniques to achieve scalability and availability: Data is partitioned and replicated using consistent hashing, and consistency is facilitated by object versioning. The consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol. Dynamo employs a gossip based distributed failure detection and membership protocol ... Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution.

In the past year, Dynamo has been the underlying storage technology for a number of the core services in Amazon's ecommerce platform.

The paper overall is fascinating, not only for the discussion of Amazon's needs and how the authors thought about them when building Dynamo, but also for critical discussion of other distributed databases. The related work section (Section 3) is worth its weight in gold.

A paper on a Amazon distributed database immediately invites comparison with the Google Filesystem and Bigtable. Unlike Google FS (and also unlike distributed databases like C-store), Amazon's Dynamo oddly is optimized for writes. From the paper:

Many traditional data stores execute conflict resolution during writes and keep the read complexity simple.

Dynamo targets the design space of ... a data store that is highly available for writes ... We push the complexity of conflict resolution to the reads in order to ensure that writes are never rejected.

This choice surprises me and does appear to lead to some problems later. The 15ms average read latency seems high if any service needs to make more than a dozen database queries when serving a web page; latencies like that make this a pretty heavyweight data store service.

As they described in the paper, at least a few groups at Amazon needed a much lighter weight service that could be hit thousands of times, so they had to use rather extreme parameters (requiring all replicas be available for writes, and only one for reads) to force Dynamo to work for them. With those parameters, they effectively turned off the high availability for writes and pushed the complexity of conflict resolution back away from reads, which makes me wonder if write optimization really should have been part of Dynamo's design goals in the first place.

Another difference with Google FS and Bigtable is that Google's systems organize data as a map, supporting high performance range scans over the data. On the one hand, Google may have more need for that, with its building of search indexes and analyzing massive text and log data. On the other hand, Amazon has massive text and log data too, and Dynamo seems like it may not be able to help with large scale data indexing and analysis tasks.

On both not supporting range scans and the optimization for writes over reads, the source of that appears to be that the authors focused on the needs of the shopping cart. They repeatedly return to that as a motivating example. It is not clear to me why they choose to focus on that task over their other needs.

I had a couple other surprises. Dynamo relies on random, uniform distribution of the keys for load balancing -- something that seems likely to run into problems with highly skewed access patterns -- rather than supporting additional replication of frequently accessed data. More serious, Dynamo is limited to a few hundred nodes because they punted on some of the hard problems of ensuring consistency in metadata (like their routing table) at larger scale.

Overall, a very interesting paper and system from Amazon. I love how Amazon has adapted the motivation of P2P distributed hash tables like Chord and Pastry to an environment with all trusted machines like an Amazon data center, taking advantage of that to reduce latency and improve reliability. I also am impressed by how remarkably configurable Dynamo is -- from the underlying storage to the number of replicas to the means of conflict resolution -- so that it can adapt to the widely varying needs of different groups at Amazon.

By the way, unlike Google, Yahoo, and Microsoft, Amazon publishes academic papers only rarely. They deserve kudos for doing so here. With this paper, Amazon is revealing some of the remarkable challenges in large scale computing they face. As people are attracted to those challenges, perhaps this will be the start of more openness from Amazon.

Saturday, October 06, 2007

Starting Findory: Hardware go boom

Computers go down, a lot more often than we might like.

For most of Findory's four years, it ran on six servers. In that time, those servers had one drive failure, one bad memory chip, and four power supplies fail.

Of these, only the two power supply failures caused outages on the site, one of one hour and one a painful eight hour outage. There were a few other short outages due to network problems in the data center, typically problems that took the entire data center offline temporarily.

Findory's six servers were all cheap commodity Linux boxes, typically a single core low-end AMD processors, 1G of RAM, and a single IDE disk. Findory was cheap, cheap, cheap.

The reliability of Findory over its lifetime was perfectly acceptable, even quite good compared to other little startups with similar levels of traffic, but I think it is interesting to think about what may have been able to prevent the outages without considerable expense.

Surprisingly, RAID disks would not have helped much, though they would have made it easier to recover from the one drive failure that did occur on a backend machine. Better redundancy on the network may have helped, but would have been expensive. Splitting servers and replicating across data centers may have helped, but would have been both expensive and complicated for a site of Findory's size and resources.

Probably the biggest issue was that Findory did not have a hot standby running on its small database. A hot standby database would have avoided both the one hour outage and the eight hour outage. Those outages were caused by losing the first, then a second power supply on the database machine.

Looking back at all of these, I think it was particularly silly not to have the database hot standby. The cost of that would have been minimal and, not only would it have avoided the outage, but it would have reduced the risk of data loss by having a constant database backup. I almost added the hot standby many times, but kept holding off on it. While I may have mostly gotten away with it, it clearly was a mistake.

Other thoughts? Anyone think it was foolish not to run in RAID and not to be split across data centers? Or, maybe you have the opposite opinion, that small startups should not worry much about uptime and small outages are just fine? How about the hardware failure rates, are they typical in your experience?

Please see also my other posts in the "Starting Findory" series.

Thick, thin, and Office Live

The new Office Live is out and seems to be getting widely panned.

Mike Arrington says, "Microsoft has failed to understand the real power of Google Docs - easy, no hassle document creation, collaboration and access from the browser."

Richard MacManus writes, "Is this what Microsoft's answer to the Web Office is - tacked on features to its all-powerful desktop suite?"

But, the ever contrarian Nick Carr responds to Richard:

Well, yes, that's precisely what Microsoft's answer is. And while MacManus is right that Microsoft's offering is "messy" and even "muddled," one should not underestimate the company's ability to shape the market by tacking features onto its "all-powerful desktop suite." It's a strategy, after all, that has served the company well many times in the past.

Microsoft's online offering does not have to be better than, say, Google's; it just has to be (a) more convenient for typical business users with (b) good enough functionality.

Why Microsoft should do more than bolt features on to Office if those features are sufficient to undermine the appeal of switching elsewhere?

If you are collaborating a lot on documents, you could switch to Google Docs. Or, you could just get by using the beast with which you and all your colleagues already are familiar, Microsoft Office, with all its additional features. Office Live appears to be an attempt to add just enough collaboration to reduce the appeal of switching to Google Docs. Not a bad strategy.

As for Mike's point, well, yes, but Microsoft does not want everything to move to the browser. Their entire business depends on a thick client -- a PC loaded with code and data -- having more value than a thin client.

I am reminded of Upton Sinclair's words: "It is difficult to get a man to understand something when his salary depends on his not understanding it." Likewise, it will be difficult to get Microsoft to promote apps running in the browser when their business depends on maintaining the value of a heavyweight desktop machine.

Update: I want to be clear that there is a difference between analyzing Microsoft's strategy and endorsing it. I am not rooting for Microsoft or anyone else here, just trying to understand why they do what they do.

Thursday, October 04, 2007

Web 2.0 is dead and spammy, long live Web 3.0?

While I am no more a fan of the name Web 3.0 than Web 2.0, Jason Calacanis has an entertaining rant on where the Web is going:

Web 3.0 throttles the "wisdom of the crowds" from turning into the "madness of the mobs" we've seen all to often, by balancing it with a respect of experts. Web 3.0 leaves behind the cowardly anonymous contributors and the selfish blackhat SEOs that have polluted and diminished so many communities.

This reminds me of what Xeni Jardin wrote back in Oct 2005:

Web 2.0 is very open, but all that openness has its downside: When you invite the whole world to your party, inevitably someone pees in the beer.

These days, peed-in beer is everywhere. Blogs begat splogs -- junk diaries filled with keyword-rich text to lure traffic for ad revenue.

See also my previous post, "Growth, crap, and spam", where I said, "There seems to be a repeating pattern with Web 2.0 sites. They start with great buzz and joy from an enthusiastic group of early adopters, then fill with crud and crap as they attract a wider, less idealistic, more mainstream audience."

See also my previous posts, "Community, content, and the lessons of the Web" and "Getting the crap out of user-generated content".

[Calacanis post found via Nick Carr]

Recommender systems and diversity

Knowledge@Wharton recently published an article, "Reinforcing the Blockbuster Nature of Media: The Impact of Online Recommenders". The article discusses research work by Kartik Hosanagar and Dan Fleder at Wharton on whether recommender systems improve diversity of sales and help people discover items that otherwise might be buried in the long tail.

An excerpt from the article:

Recommenders -- perhaps the best known is Amazon's -- tend to drive consumers to concentrate their purchases among popular items rather than allow them to explore and buy whatever piques their curiosity, the two scholars suggest.

Hosanagar and Fleder argue that online recommenders "reinforce the blockbuster nature of media." And they warn that, by deploying standard designs, online retailers may be recreating the very phenomenon -- circumscribed media purchasing choices -- that some of them have bragged about helping consumers escape.

I am briefly quoted in the article arguing for a somewhat milder conclusion, saying:

Linden, reached via email, declares himself untroubled by Hosanagar and Fleder's findings. "Recommendation algorithms easily can be tuned to favor the back catalog -- the long tail -- as Netflix does," he argues. Netflix, the online DVD purveyor, consciously highlights obscure titles in designing its recommender.

Linden also argues that, in the absence of online recommenders, consumers would turn to even cruder tools, like traditional bestseller lists. "You have to ask what content would otherwise be in place of the recommendations and whether that content would have greater diversity," he says.

Hosangar and Fleder have two papers detailing their work, a very long paper titled "Blockbuster Culture's Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity" and a shorter ACM article, "Recommender systems and their impact on sales diversity".

The papers are an interesting read. What I found most surprising about their work was that, in their simulations, a recommender algorithm that did compensate for bestseller bias (called r4 in their paper) still reduced diversity. Although I had questions I had about their simulation model (which I already have discussed with Dan), I think their work should serve as an additional caution to those working on recommender systems to be concerned the impact choices in the algorithm can have on level of diversity, especially if one of the business goals of the recommendations is to drive movement in the back catalog.

Please also see comments on this research work from recommender researcher and U of Michigan Professor Paul Resnick, particularly his thoughts on the simulation framework used.

Wednesday, October 03, 2007

Revisiting Yahoo Answers

Looking at Yahoo Answers again two years after its launch, I think I was wrong to think of it as a question answering website.

Despite the name, Yahoo Answers is a discussion forum. People are using it like a newsgroup, chatting about various topics. They are not using Yahoo Answers as much to generate authoritative answers to questions.

For example, looking at the "popular" answers right now, I see these questions: "Rudy Giuliani vs. Hillary Clinton?" "What's the last CD that you started out LOVING, but you played it so much you 'retired' it indefinitely?" "Egypt: when did I feel sad in Egypt?" "Name 5 one hit wonders of the eighties?"

These are not questions that have any objective answer. Rather, they are a discussion started with a subject line in the form of a question. People are "answering" a question not to provide an answer, but to engage in a conversation and chat with other people.

If Yahoo Answers is a discussion forum, it has a few implications. First, it should be compared not with the now-defunct Google Answers or Google's NLP question answering work, but with Google Groups, Slashdot, and other popular forums.

Second, those seeking to emulate Yahoo Answers probably would be mistaken to focus on question answering over community. People using Yahoo Answers are seeking conversations, not truth. The site is successful because it creates fun discussions and entertains people, not because it yields knowledge.

Third, Yahoo Answers itself may want to emphasize features that favor discussion and deemphasize features focused on generating high quality answers. In particular, while the "best answer" feature creates a fun, winner-takes-all type of contest, ending the discussions after a short period of time may be undesirable if the discussion is still attracting attention. In addition, if promoting discussion is the goal, features that help people find discussions they want to join would be beneficial; search and browse features may want to favor that over finding direct answers to questions.

I have to admit, I am late to the party in understanding the true purpose of Yahoo Answers. Two years ago, when Yahoo Answers launched, Gary Price said, "Will Yahoo Answers simply be the next generation of an online bulletin board?" Looks like he was right on.

See also my Dec 2005 post, "Yahoo Answers and the wisdom of the crowd", where I, mistakenly it appears, focused on Yahoo Answers as a question answering tool.

Geeking with Greg