Tuesday, July 10, 2007

Attacking recommender systems

A good (but very long) paper by Mobasher et al., "Towards trustworthy recommender systems: An analysis of attack models and algorithm robustness" (PDF), explores a variety of ways of spamming or otherwise manipulating recommendation systems.

Some excerpts:
An attack against a collaborative filtering recommender system consists of a set of attack profiles, each contained biased rating data associated with a fictitious user identity, and including a target item, the item that the attacker wishes the system to recommend more highly (a push attack), or wishes to prevent the system from recommending (a nuke attack).

Previous work had suggested that item-based collaborative filtering might provide significant robustness compared to the user-based algorithm, but, as this paper shows, the item-based algorithm also is still vulnerable in the face of some of the attacks we introduced.
The paper lists several types of attacks, suggests several ways to detect attacks, tests several attacks using the GroupLens movie data set, and concludes that "item-based proved far more robust overall" but that "a knowledge-based / collaborative hybrid recommendation algorithm .... [that] extends item-based similarity by combining it with content based similarity .... [seems] likely to provide defensive advantages for recommender systems."

It is worth noting that spam is a much worse problem with winner-take-all systems that show the most popular or most highly rated articles (like Digg). In those systems, spamming gets you seen by everyone.

In recommender systems, spamming only impacts the fraction of the users who are in the immediate neighborhood and see the spammy recommendations. The payoff is much reduced and so is the incentive to spam.

For more on that, please see my Jul 2006 post, "Combating web spam with personalization" and my Jan 2007 post, "SEO and personalized search".

The same authors published a similar but much shorter article, "Attacks and Remedies in Collaborative Recommendation", in the May/June IEEE Intelligent Systems, but there is no full text copy of that article easily available online.

Thanks, Gary Price, for pinging me about the IEEE Intelligent Systems May/June issue and pointing out that it has several articles on recommender systems.


Anonymous said...

To be more specific, the current spate of gaming techniques are afflicting those recommender systems whose recommendations are based primarily on tracking user interaction or what friends/community members are doing.

Systems that are based primarily on recommending based on similarity in the data itself (be it music data or blog text for example) are not prone to these attacks as the data is generally "mapped" out of reach of the users.

From there it can of course be personalized by the user for their own benefit which would mean that any "gaming" at this point only likely serves to game the user themselves. Not a particularly useful idea.

Anonymous said...

"a knowledge-based / collaborative hybrid recommendation algorithm .... [that] extends item-based similarity by combining it with content based similarity .... [seems] likely to provide defensive advantages for recommender systems."

Essentially, that was Google's main insight, wasn't it? One part content-based similarity (term frequencies, etc.), and one part collaborative filtering (linking recommendation, or PageRank). That's what gave the big boost, way back when.

Greg Linden said...

Interesting point, Jeremy. I hadn't thought of Google's relevance rank that way, but, yes, I suppose you could frame it as a recommender system that is using a combination of content data (term frequencies, etc.) and user data (links, clicks, etc.).

That example breaks down a bit because users all get the same search results (with the exception of the relatively new personalized search), but it is an interesting way to look at Google's relevance rank.

A closer example might be Findory's recommendation engine. It does use both content-based similarity and user behavior-based similarity for its recommendations.

Anonymous said...

No matter how you look at it, a mixture of implicit and explicit statements of interest and value is always preferable. The fact is that some enterprises and sites will be damaged in the arms race between the relevancy innovators and the denial-of-insight warriors. One just doesn't want to be the one harmed.

Anonymous said...

Another key factor in gaming recommender situations is the cost of faking data. Amazon is relatively immune to recommendation spam because it is based primarily on purchase data, not just because suggestions are largely per-product (or, as in Ian's comment, because the recommendations rely on properties of the data itself). Search engines, although their results are per-query, are more susceptible because much of the data used to rank pages is not that costly to spoof. (The difficulty there lies in generating realistic data.)