Monday, December 24, 2007

Papers from WSDM 2008 on click position bias and social bookmark data

The WSDM conference is being held Feb 11-12 at Stanford University. I am not sure I will make it down from Seattle for it, but, if you are in the SF Bay Area and are interested in search and data mining on the Web, it is an easy one to attend.

Most of the papers for the conference do not appear to be publicly available yet, but, of the ones I could find, I wanted to highlight two of them.

Microsoft Researchers Nick Craswell, Onno Zoeter, Michael Taylor and Bill Ramsey wrote "An Experimental Comparison of Click Position-Bias Models" (PDF) for WSDM 2008. The work looks at models for how "the probability of click is influenced by ... [the] position in the results page".

The basic problem here is that just putting a search result high on the page tends to get it more clicks even if that search result is less relevant than ones below it. If you are trying to learn which results are relevant by looking at which ones get the most clicks, you need to model and then attempt to remove the position bias.

The authors conclude that a "cascade model" which assumes "that the user views search results from top to bottom, deciding whether to click each result before moving to the next" most closely fits searcher click behavior when they look at the top of the search results. However, their "baseline model" -- which assumes "users look at all results and consider each on its merits, then decide which results to click" (that is, position does not matter) -- seemed most accurate for items lower in the search results.

The authors say this suggests there may be "two modes of results viewing", one where searchers click the first thing that looks relevant in the top results, but, if they fail to find anything good, they then shift to scanning all the results before clicking anything.

By the way, if you like this paper, don't miss Radlinski & Joachims' work on learning relevance rank from clickstream data. It not only discusses positional bias in click behavior in search results, but also attempts the next and much more ambitious step of optimizing relevance rank by learning from click behavior. The Craswell et al. WSDM 2008 paper does cite some older work by Joachims and Radlinski, but not this fun and more recent KDD 2007 paper.

The second paper I wanted to point out is by Paul Heymann, Georgia Koutrika and Hector Garcia-Molina at Stanford, "Can Social Bookmarks Improve Web Search?" I was not able to find that paper, but a slightly older tech report (PDF) with the same title is available (found via ResourceShelf). This paper looks at whether data from social bookmark sites like del.icio.us can help us improve Web search.

This is a question that has been subject to much speculation over the last few years. On the one hand, social bookmark data may be high quality labels on web pages because they tend to be created by people for their own use (to help with re-finding). On the other hand, manually labeling the Web is a gargantuan task and it is unclear if the data is substantially different than what we can extract automatically.

Unfortunately, as promising as social bookmarking data might seem, the authors conclude that it is not likely to be useful for Web search. While they generally find the data to be of high quality, they say the data only covers about 0.1% of the Web, only a small fraction of those are not already crawled by search engines, and the tags in social bookmarking data almost always are "obvious in context" and "would be discovered by a search engine." Because of this, the social bookmarking data "are unlikely to be numerous enough to impact the crawl ordering of a major search engine, and the tags produced are unlikely to be much more useful than a full text search emphasizing page titles."

Update: It looks like I will be attending the WSDM 2008 conference after all. If you are going, please say hello if you see me!

Friday, December 21, 2007

Interactive machine learning talk

Dan Olsen at BYU gave a talk, "Interactive Machine Learning", at UW CS a couple months back.

Dan's group is doing some clever work that combines machine learning and HCI. The UW CS talk is good but long. If you are short on time, first take a look at the fun short demo videos Dan's group produced.

I particularly recommend seeing the clever "Screen Crayons" (WMV) application for annotating documents and the "Teaching Robots to Drive" (WMV) demo of an intuitive interface for training a robot car. The "Image Processing with Crayons" (WMV) demo is also good for getting quick introduction to the core idea.

Tuesday, December 18, 2007

Findory turns off the lights

Findory turned off its last webserver today. Sadness.

Previous posts ([1] [2] [3] [4] [5] [6]) have more details on the shutdown and Findory's history.

Monday, December 17, 2007

Microsoft and intelligent agents on the desktop

John Markoff quotes Microsoft Chief Research Officer Craig Mundie on the opportunity multicore processors on our desktop creates for AI and personalized software agents:
In the future, Mr. Mundie said, parallel software will take on tasks that make the computer increasingly act as an intelligent personal assistant.

"My machine overnight could process my in-box, analyze which ones were probably the most important, but it could go a step further," he said. "It could interpret some of them, it could look at whether I've ever corresponded with these people, it could determine the semantic context, it could draft three possible replies. And when I came in in the morning, it would say, hey, I looked at these messages, these are the ones you probably care about, you probably want to do this for these guys, and just click yes and I'll finish the appointment."
Craig had more extensive thoughts at the July 2007 Microsoft Analyst meeting on personalized assistants running on your desktop PC.

Eric Enge interviews Sep Kamvar

Eric Enge posted an interview with Google personalization guru Sep Kamvar.

Some highlights of Sep's answers below:
The two signals that we use right now are the search history and the location. We constantly experiment with other signals, but the two signals that have worked best for us are location and search history.

Some signals that you expect would be good signals, turn out not to be that good. So for example, we did one experiment with Orkut, and we tried to personalize search results based on the community that users had joined. It turns out that while people were interested in the Orkut communities, they didn't necessarily search in line with those Orkut communities.

It actually harkened back to another experiment that we did, where in our first data launch of personalized search we allowed everybody to just check off categories that were of interest to them. People did that, and people would check off categories like literature. Well, they were interested in literature, but they actually didn't do any searching in literature. So, what we thought would be a very clean signal, actually turned out to be a noisy signal.

When I think about what I am interested in, I don't necessarily think about what I am interested in that I search for and what I am interested in that I don't search for. That's something that we found was better learned algorithmically rather than directly.

A signal should be very closely aligned with search and what you are searching for in order for it to be useful to personalizing search ... In addition, we've found that your more recent searches are much more important than searches from a long time ago.
For more on the problems with explicitly extracting preferences -- as Sep found when explicitly asking for each user's category interests in an early version of Google Personalized Search -- please see my post, "Explicit vs. implicit data for news personalization", and the links from that post.

For more on trying to use signals not closely aligned with search, please see my earlier post, "Personalizing search using your desktop files".

For more on the importance of focusing on recent searches for personalized search, please see also my past posts, "The effectiveness of personalized search" and "The many paths of personalization".

Thursday, December 13, 2007

Geoffrey Hinton on the next generation of NNets

AI guru Geoffrey Hinton recently gave a brilliant Google engEdu talk, "The Next Generation of Neural Networks".

If you have any interest in neural networks (or, like me, got frustrated and lost all interest in the mid-1990s), set aside an hour and watch the talk. It is well worth it.

The talk starts with a short description of the history of neural networks, focusing on the frustrations encountered, and then presents Boltzmann machines as a solution.

Geoffrey clearly is motivated by trying to imitate the "model the brain could be using." For example, after fixing the output of a model to ask it to "think" of the digit 2, he enthusiastically describes the model as his "baby", the internal activity of one of the models as its "brain state", and the output of different forms of digits it recognizes as a 2 as "what is going on in its mind."

The talk is also full of enjoyably opinionated lines, such as when Geoffrey introduces alternating Gibbs sampling as a learning method and says, "I figured out how to make this algorithm go 100,000 times faster," adding, with a wink, "The way you do it is instead of running for [many] steps, you run for one step." Or when he dismissively calls support vector machines "a very clever type of perceptron." Or when he criticizes locality sensitive hashing as being "50 times slower [with] ... worse precision-recall curves" than a model he built for finding similar documents. Or when he said, "I have a very good Dutch student who has the property that he doesn't believe a word I say" when talking about how his group is verifying his claim about the number of hidden unit layers that works best.

You really should set aside an hour and watch the whole thing but, if you can't spend that much time or can't take the level of detail, don't miss the history of NNets in the first couple minutes, the description and demo of one of the digit recognition models starting at 18:00 (slide 19), and the discussion of finding related documents using these models starting at 31:40 (slide 28).

On document similarity, it was interesting that a Googler asked a question about finding similar news articles in the Q&A at the end of the talk. The question was about dealing with substantial changes in the types of documents you see -- big news events, presumably, that cause a bunch of new kinds of articles to enter the system -- and Geoffrey addressed it by saying that small drifts could be handled incrementally, but very large changes would require regenerating the model.

In addition to handwriting recognition and document similarity, some in Hinton's group have done quite well using these models in the Netflix contest for movie recommendations (PDF of ICML 2007 paper).

On a lighter note, that is the back of Peter Norvig's head that we see at the bottom of the screen for most of the video. We get two AI gurus for the price of one in this talk.

BellKor ensemble for recommendations

The BellKor team won the progress prize in the Netflix recommender contest. They have published a few papers ([1] [2] [3]) on their ensemble approach that won the prize.

The first of those papers is particularly interesting for the quick feel of the do-whatever-it-takes method they used. Their solution consisted of tuning and "blending 107 individual results .... [using] linear regression."

This work is impressive and BellKor deserves kudos for winning the prize, but I have to say that I feel a little queasy reading this paper. It strikes me that this type of ensemble method is difficult to explain, hard to understand why it works, and likely will be subject to overfitting.

I suspect not only will it be difficult to know how to apply the results of this work to different recommendation problems, but also it even may require redoing most of the tuning effort put in so far by the team if we merely swap in a different sample of the Netflix rating data for our training set. That seems unsatisfying to me.

It probably is unsatisfying to Netflix as well. Participants may be overfitting to the strict letter of this contest. Netflix may find that the winning algorithm actually is quite poor at the task at hand -- recommending movies to Netflix customers -- because it is overoptimized to this particular contest data and the particular success metric of this contest.

In any case, you have to admire the Bellkor team's tenacity. The papers are worth a read, both for seeing what they did and for their thoughts on all the different techniques they tried.

Sunday, December 09, 2007

Facebook Beacon attracts disdain, not dollars

I have been watching the uproar over Facebook Beacon over the last couple weeks with some amusement.

The system was intended to aggregate purchase histories from some online retailers, a poorly thought out attempt to deal with the lack of purchase intent that makes it difficult for Facebook to generate much revenue from advertising.

Om Malik calls ([1] [2] [3]) Facebook Beacon a "privacy nightmare", a "fiasco", and a "major PR disaster". He goes on to write that, even after the latest changes, "I don't think it's easy to trust Facebook to do the right thing."

Dare Obasanjo says "Facebook Beacon is unfixable" because "affiliate sites are pretty much dumping their entire customer database into Facebook ... without their customers permission" and accuses Facebook of "violations of user privacy to make a quick buck."

Eric Eldon at VentureBeat writes that Facebook is dealing with "a revolt against the feature, because it sends messages to your friends about your purchase and other online behavior .... whether or not you are logged in to Facebook and whether or not you have approved any data sharing."

The NYT quotes one Facebook user as saying, "Just because I belong to Facebook, do I now have to be careful about everything else I do on the Internet?" and quotes another as saying, "I feel like my trust in Facebook has been violated."

Facebook faces a hard challenge here. Facebook users are not coming to Facebook thinking of buying things. Because of this lack of commercial intent, most advertisements are likely to be perceived as irrelevant and useless to Facebook users. That will lead to low value to advertisers and low revenues for Facebook.

So, Facebook will struggle desperately to get the revenues promised by their $15B valuation, doing things that almost certainly will annoy and anger their users.

However, Facebook's success largely is due to benefiting from a fad. People flock to whatever social networking site all their friends are using. It wasn't always Facebook. MySpace was once considered by some crowds to be the place to be.

I wonder if all the annoying things that Facebook starts to do with its advertising will be what makes Facebook become uncool. It may be what makes Facebook's fickle audience want to find some new place to hang, somewhere that doesn't suck, and makes Facebook become yesterday's news.

One Wikipedia to rule them all

John Battelle notes a study that reports:
In December 2005 ... 2% of the [top] links proposed by Google and 4% of those proposed by Yahoo came from Wikipedia.

Today 27% of Google's results on the first link alone come from Wikipedia, as do 31% of Yahoo's.
Nick Carr once wrote of this trend, saying:
Could it be that, counter to our expectations, the natural dynamic of the web will lead to less diversity in information sources rather than more?

And could it be that Wikipedia will end up being Google's most formidable competitor? After all, if Google simply points you to Wikipedia, why bother with the middleman?
Update: A week later, the NYT reports that Google started testing a "Wikipedia competitor" called Knol. Google VP Udi Manber "said the goal of Knol was to cover all topics, from science to medicine to history, and for the articles to become 'the first thing someone who searches for this topic for the first time will want to read.'"

Google shared storage and GDrive

The WSJ reports that "a service that would let users store on its computers essentially all of the files they might keep .... could be released [by Google] as early as a few months from now."

Many others have launched products that provide limited storage on the cloud, including AOL, Microsoft, and Yahoo, but Google appears to have an unusual take on it, backing up all of the users files and providing search over them.

One way to view this might be as an extension of the "search across computers" feature of Google Desktop Search. That feature already lets "you to search your home computer from your work computer" by copying an index of the "files that you've been working with recently" and "your web history" to Google, but is limited to only a fraction of your desktop files.

From the sound of the WSJ article -- "a service that would let users store on its computers essentially all of [their] files" -- Google not only will let Google Desktop Search indexes all of your files, but also will copy the original file to the Google cloud.

Please see also my older posts ([1] [2]) on Google's GDrive.

Please see also Philipp Lenssen's post of screen shots from a leaked copy of a version of GDrive (codenamed Platypus) that is only available inside of Google. It apparently replicates and synchronizes all your files across multiple machines.

Friday, December 07, 2007

Google Reader feed recommendations

Googler Steve Goldberg announces that Google Reader has launched feed recommendations based on "what other feeds you subscribe to, as well as your Web History data."

My Google Reader recommendations are quite good. My top recommendations are Seattlest, Slog (from The Stranger, a Seattle free newspaper), Natural Language Processing Blog, Machine Learning etc, and Daniel Lemire's blog.

Pretty hard to complain about those. Nice work, Nitin Shantharam and Olga Stroilova.

It is worth noting that Bloglines has had recommendations for some time, but their recommendations suffer badly from a popularity bias. For example, my top Bloglines recommendations are Slashdot, Dilbert, NYT Technology, Gizmodo, and BBC World News.

[Found via Sriram Krishnan]