Tuesday, January 17, 2006

Recommender Systems Research at Yahoo

At the Beyond Personalization 2005 workshop early last year, several people from Yahoo presented a short paper, "Recommender Systems Research at Yahoo! Research Labs" (PDF).

The paper describes "some of the ongoing projects at Yahoo! Research Labs that involve recommender systems ... and solutions relevant to Yahoo's business."

The paper is too short to have much detail, but it does give some examples, including a content-based movie recommendation prototype that sounds vaguely similar to some stuff at IMDb.com and a more unusual project attempting to make music recommendations based on similarities in the raw audio streams. It also briefly discusses some more general questions, including the cold start problem and using content versus user behavior data.

The paper says that Yahoo plans to integrate recommendation technology into Yahoo Search and Overture, among other places. Personalized search and personalized advertising, coming to you soon from Yahoo.


Jeremy said...

Greg, you talk about a more unusual project attempting to make music recommendations based on similarities in the raw audio streams.

I would hardly call this unusual. Researchers have been doing this since at least 1997, if not earlier. Many of us organized the first music information retrieval conference back in 2000 (ISMIR), and it's been going strong, with hundreds of attendees, for the past 6 years.

Here is a list of papers through 2004, most dealing with content-based (rather than user-recommendation or user-tagging) based retrieval of music:


And here are the papers from the 2005 conference:


There is lots of interesting work out there that Yahoo and Google and others are only beginning to become aware of, even though it has been around for quite a while.

Greg Linden said...

Thanks, Jeremy, for the link to the conference papers. As you said, most of those papers are not about doing recommendations based on similarities in raw audio streams.

I don't think I claimed that Yahoo is the only one doing this. I did say that the technique is more unusual. In particular, I am not aware of any commercial systems using this technique. Do you know of any?

There's a difference between interesting work in a research lab and a deployed commercial system that works for millions of users. If you are suggesting that Yahoo and Google merely need to be aware of the existing research literature to have a viable product, I think you're underestimating the amount of work left to be done.

Jeremy said...


No, no, what I said was most of the papers are about doing recommendations based on similarities in raw audio streams. Most of the work is about automatically extracting semantic information from the raw audio, such as tempo/beat, rhythm and rhythmic structure, melody, harmony and harmonic structure, timbre (loosely: instrumentation), and so on, and using that information to develop robust content-based music similarity measures.

And similarity measures are at the heart of any content-based recommendation system.

There is one company I know of that is a commercial application of systems of this nature: Pandora.

Over the past 6 years they've paid dozens of trained musicians to go through hundreds of thousands of songs, manually labeling the musical attributes of each song, everything from tempo and rhythm to the amount of "breathiness" and "vibrato" in a singer's voice.

The recommendation system itself is quite simple, and scalable: nearest-neighbor in this semantic feature space.

They do not count the popularity of the song when calculating the similarity. They do not see how many of your "buddies" have it on their playlist when calculating the similarity. They base their recommendation purely on the musical content of the song.

Yes, at this point, their features are manually extracted, rather than automatically (the latter approach tends to be favored in academia). But the system is still a working system, which streams hundreds of thousands of songs in real time, to however many users they currently have.

Even better, Pandora is 100% DMCA compliant. No Google Library Scan issues here. And it's either ad-supported or $36/year subscription based. Either way, quite cheap to me, the user.

Another recently-introduced system comes from Gracenote (CDDB). I've not had a chance to play around with their DSP (digital signal processing, i.e. "content") based recommendations, yet. But Gracenote is definitely a company with millions of users.

I know of three others startups from the ISMIR community that are currently in stealth mode. So stealthy that I do not know their exact approaches. But by looking at what they've published over the past few years, one gets the idea that content-based similarity is central in the scheme. So, like said, there is a lot of interesting work out there that the major search engines are only beginning to become aware of.

I didn't mean to imply that you'd said Yahoo was the only one working on this problem. I guess I just feel like the majors are moving reaaally slowly on this.