Monday, November 27, 2006

Item-to-item collaborative filtering

There appears to be a little confusion in some of the research literature on the earliest work on item-to-item collaborative filtering, a recommender algorithm that focus on items rather than users.

The earliest work of which I am aware is:
G. Linden J. Jacobi and E. Benson, Collaborative Recommendations Using Item-to-Item Similarity Mappings, US Patent 6,266,649 (to, Patent and Trademark Office, Washington, D.C., 2001
That patent was filed in 1998 and issued in 2001.

A later academic paper
Greg Linden, Brent Smith, Jeremy York, Recommendations: Item-to-Item Collaborative Filtering, IEEE Internet Computing, v.7 n.1, p. 76-80, January 2003
is a more friendly description of the work in the 1998 patent. It cites the patent as previous work.

Another paper that appears to be frequently cited is:
Badrul Sarwar, George Karypis, Joseph Konstan, John Reidl, Item-based collaborative filtering recommendation algorithms, Proceedings of the 10th international conference on World Wide Web, p.285-295, May 01-05, 2001, Hong Kong, Hong Kong
Some publications mistakenly have written that Sarwar et al. first "introduced" or "proposed" item-to-item collaborative filtering. Even The Economist got this wrong, later issuing a correction.

This confusion may be because the Sarwar et al. paper did not reference the patent. The 1998 patent preceded Sarwar et al. by three years and was public information well before the Sarwar et al. paper was published. The 1998 work probably should have been cited by Sarwar et al., certainly if any of the authors had reviewed it, but probably even if they had not.

I realize the situation is complicated. The 1998 patent usually is referenced with its 2001 issue date, as is the convention, making it less clear that it preceded Sarwar et al. by three years. The Sarwar et al., the first academic publication on the algorithm, does not reference the patent at all, further confusing the issue.

Nevertheless, please be careful about crediting the earliest work. Please note the earlier publications when writing about item-to-item collaborative filtering.


Anonymous said...

You are perfectly quoted on my table. But I will continue to evangelize ... findory is so unknown in France but I am doing the work for you around...

Anonymous said...

Wow. It's surprising that item-to-item collaborative filtering is such a recent discovery!


Anonymous said...

I'm not completely clear on what you mean by item-to-item collaborative filtering (CF). This is not my primary research area, so I beg your indulgence.

But if I understand correctly, traditional CF works by searching for similar users, and then recommending things that they also like, right? Similarity is defined by the ratings they give to the same items. I.e. "user similarity through items".

So is item-to-item CF just an inversion of this idea? Does it work by computing the similarity between items by measuring the number of users that have deemed them similar, in that the users have applied the same action or rating to both items?

If so, then how is this, generally speaking, different from market basket analysis? Grocery stores pore over user purchase data, computing the item-to-item similarity of any two products by examining the same-time purchase behavior of a large number of people. It is the collaboration of all those market baskets that let retailers know which items to place together on the shelves (i.e. "recommend" to their shoppers), such as putting the bean dip near the tortilla chips. Even though bean dip and tortilla chips are very dissimilar items, collaborative behavior shows one to be a good recommendation when purchasing the other.

So how is item-to-item CF different from market basket analysis? Maybe not in the mathemetical specifics, but in the general notion of determining the similarity of two items via their user usage patterns. The latter has certainly been around longer than 1998, so there is obviously something I'm not quite understanding here. What am I missing?

srowen said...

You're right that one incarnation of item-based CF is almost just user-based CF on the "transpose" of the user-item preference matrix (it's not perfectly symmetric since you're still recommending items to users, not users to items. :) )

But I think item-based CF opens up a different approach based on two observations. One, items are more fixed than users, so item-item similarities shouldn't change too much in theory (bean dip will probably always turn up as similar to salsa or chips for any group of shoppers.)

Second, in many cases you have a priori knowledge of how similar items are. For example, you might define an metric on movie items based on genre, director, star, etc. If you have this static similarity metric, instead of calculating similarity dynamically based on correlation (as in user-based CF), well, you potentially have a much faster, more scalable CF system. Those correlation calculations are the slow part.

(Plug: I work on an open-source CF system called Taste which implements all this stuff)

Anonymous said...

srowen: Actually, it is probably a little more symmetric than you say; you really are recommending items to items. Given one particular item, you want to recommend all the items that are most similar to that item. It is just that (1) the particular item really is attached to a user, because he just viewed it, purchased it, or whatever, and (2) you go about that similarity "through" the amalgamation ("collaboration") of mass quantities of user behavior such as purchases and/or ratings.

But that aside, regarding your second point about the static similarity metric, I'm totally with you on that. I worked for 6 years on a music similarity system. But rather than base it on which users were listening to which songs, I based it on actual acoustic/audio/musicological properties of the song itself. Chord progressions, tempo, beat structure, etc. In other words, I based it on the content of the item.

I think that is what you are saying with the movie example, by basing your static metric on genre, director, actors, etc.

However, at that point, I think you no longer have a collaborative filtering system. You have a content-based filtering system. To me, "collaborative" implies some sort of user basis or behavioral aggregation mechanism. CF was the original Web 2.0. But content-based similarity is the anti-Web 2.0, the anti-CF. It is similarity based not on how many "diggs" a song or a movie gets, but on actual, objective, musicological properties of the song itself.

I think the term "Web 3.0" is a bit of a joke, but if there ever was a candidate for Web 3.0, it would be content-based similarity, rather than collaborative judgement-based similarity. But Greg certainly was there, 8 years ago, doing ur-Web 2.0 things with Amazon recommendations.

I'm rambling now. Sorry.

Back to Greg, then: I really am curious about the difference between market-basket analysis and some of your 1998 insights. What was it about that work that was fundamentally different than MBA? Even with srowen's helpful explanations, I still don't think I get it.

Greg Linden said...

Hi, Jeremy. I hope you understand that I cannot get into a public discussion about prior work for an Amazon patent on my blog.

Very briefly, I think I can say that the differences with MBA include at least how the item rules are calculated -- not merely conditional probability, for one example -- and then are combined.

I apologize that I cannot go into more detail than that here. Anything more and I fear I may risk a horde of IP lawyers descending upon my head.

Anonymous said...

An interesting page I have found with some early work is - AAAI-98 Workshop on Recommender Systems is as early as I got.

Anonymous said...

heya Greg: No, I am of course not asking for any detail not already published in the patent. But once the patent is issued, you are free to talk about any aspect of the work already covered by the disclosure, are you not? I could actually go in and read the patent, in full right now, right? And I don't know how it is at Amazon, but at my company, the moment we submit a patent we can publish the work at an academic conference (like SIGIR). And that is a whole lot of public disclosure and discussion.

In any case, I have no wish for any lawyers to descend on your head, so of course I am not asking you for anything you are not comfortable with. It's just that, if you make the claim that yours was the "first" work on "item to item collaborative filtering", I'd like to know a little bit more how you exactly define these terms, i.e. the unique conceptual framework you've set up, for which you are claiming "firstness".

Because when I think about item-to-item filtering, in general, I think about Gerry Salton and SMART, the whole idea of comparing the similarity between two document "items" via their cosine similarity in a term-based vector space.

But that is of course not collaborative, i.e. no humans involved in the loop, save the authors of the documents, and that is so indirect as to not count.

And so when I think about the collaborative aspect, I think "oh, you could just replace term frequency vectors with purchase frequency vectors (or rating vectors or whatever), and you'd have a vector space instance of an item-to-item CF.

Or you could calculate item purchase co-occurrences rather than vector space cosine similarities, and you'd have another instance of an item-to-item CF: Market basket analysis.

In the little you do disclose above, you say that differences between your work and MBA "include at least how the item rules are calculated -- not merely conditional probability, for one example -- and then are combined."

So that says to me that the mathemtics are slightly different -- maybe you model joint probabilities rather than conditional probabilities? -- but the basic notion of connecting items through user usage patterns is not different. And that basic notion is, to my limited understanding, what defines and frames the whole notion of "item to item collaborative filtering".

I have no doubt you created a much better item-to-item collaborative filter than existing MBA techniques. I'm just taking issue with the claim of being the "first" to come up with the notion of item-to-item CF.

But let me admit, once again, that this could be due to my limited understanding of the domain. So that's what I am seeking clarification on. I of course do not want you to publicly disclose any proprietary Amazon mathematics. But at a general, conceptual level, what makes your work the "first" in item-to-item CF?

Let me put it another way: Why do you not consider MBA to be item-to-item CF? (You would have to consider it not to be item-to-item CF, if you are to claim being the first work in this area.) That might be the easier question to answer, and might not bring any lawyers down on your head.

That is all I am after, here. I'm not after any disclosure of secrets. I just want to know why, if I am going to cite something as the "first" or the "ur" publication in a certain area, why it really is the first.

Anonymous said...

By the way, let me say that I am very sympathetic to your desire to be correctly attributed. In this discussion, I am in no way saying you shouldn't be. I am just seeking to understand why you think you are indeed the first -- and MBA is not an earlier instance of an item-to-item CF system.

But whether or not MBA is an earlier instance of item-to-item CF, I still agree with you that, given the choice between your work and the Sarwar work, yours should probably be cited. What you say in that regard makes sense.

Anonymous said...

Greg, not even a comment on why MBA is not item-to-item collaborative filtering? That would clear up the entire matter (regarding "firstness") immediately, and would not have to involve any discussion of any work done at Amazon. I'm not asking for much.. you could even point me to a link or reference explaining why the two things are not the same.

Anonymous said...

Thanks for this great post which clarifies who invented what. Let us agree that you invented item-based collaborative filtering.

However, you seem to imply that we cite the issue date "out of convention". No we don't.

What if we both invent something, but I post it on my Web site first. Who will be credited? Me. What counts? The day I invent the thing? The day I share it with my boss? The day I send it to the patent office?

As far as academia is concerned, the day where the invention is published is what matters. Date of publication is what is important.