Monday, October 19, 2009

Using the content of music for search

I don't know much about analyzing music streams to find similar music, which is part of why I much enjoyed reading "Content-Based Music Information Retrieval" (PDF). It is a great survey of the techniques used, helpfully points to a few available tools, and gives several examples of interesting research projects and commercial applications.

Some extended excerpts:
At present, the most common method of accessing music is through textual metadata .... [such as] artist, album ... track title ... mood ... genre ... [and] style .... but are not able to easily provide their users with search capabilities for finding music they do not already know about, or do not know how to search for.

For example ... Shazam ... can identify a particular recording from a sample taken on a mobile phone in a dance club or crowded bar ... Nayio ... allows one to sing a query and attempts to identify the work .... [In] Musicream ... icons representing pieces flow one after another ... [and] by dragging a disc in the flow, the user can easily pick out other similar pieces .... MusicRainbow ... [determines] similarity between artists ... computed from the audio-based similarity between music pieces ... [and] the artists are then summarized with word labels extracted from web pages related to the artists .... SoundBite ... uses a structural segmentation [of music tracks] to generate representative thumbnails for [recommendations] and search.

An intuitive starting point for content-based music information retrieval is to use musical concepts such as melody or harmony to describe the content of music .... Surprisingly, it is not only difficult to extract melody from audio but also from symbolic representations such as MIDI files. The same is true of many other high-level music concepts such as rhythm, timbre, and harmony .... [Instead] low-level audio features and their aggregate representations [often] are used as the first stage ... to obtain a high-level representation of music.

Low-level audio features [include] frame-based segmentations (periodic sampling at 10ms - 1000ms intervals), beat-synchronous segmentations (features aligned to musical beat boundaries), and statistical measures that construct probability distributions out of features (bag of features models).

Estimation of the temporal structure of music, such as musical beat, tempo, rhythm, and meter ... [lets us] find musical pieces having similar tempo without using any metadata .... The basic approach ... is to detect onset times and use them as cues ... [and] maintain multiple hypotheses ... [in] ambiguous situations.

Melody forms the core of Western music and is a strong indicator for the identity of a musical piece ... Estimated melody ... [allows] retrieval based on similar singing voice timbres ... classification based on melodic similarities ... and query by humming .... Melody and bass lines are represented as a continuous temporal-trajectory representation of fundamental frequency (F0, perceived as pitch) or a series of musical notes .... [for] the most predominant harmonic structure ... within an intentionally limited frequency range.

Audio fingerprinting systems ... seek to identify specific recordings in new contexts ... to [for example] normalize large music content databases so that a plethora of versions of the same recording are not included in a user search and to relate user recommendation data to all versions of a source recording including radio edits, instrumental, remixes, and extended mix versions ... [Another example] is apocrypha ... [where] works are falsely attributed to an artist ... [possibly by an adversary after] some degree of signal transformation and distortion ... Audio shingling ... [of] features ... [for] sequences of 1 to 30 seconds duration ... [using] LSH [is often] employed in real-world systems.
The paper goes into much detail on these topics as well as covering other areas such as chord and key recognition, chorus detection, aligning melody and lyrics (for Karaoke), approximate string matching techniques for symbolic music data (such as matching noisy melody scores), and difficulties such as polyphonic music or scaling to massive music databases. There also is a nice pointer to publicly available tools for playing with these techniques if you are so inclined.

By the way, for a look at an alternative to these kinds of automated analyses of music content, don't miss this last Sunday's New York Times Magazine section article, "The Song Decoders", describing Pandora's effort to manually add fine-grained mood, genre, and style categories to songs and articles and then use it for finding similar music.


Tom Butcher said...

I recently ran across the same paper and liked it as well. I hadn't seen the NYT piece though - thank you for posting that.

One of my colleagues used to work for a music startup that had a similar model to Pandora, only instead of requiring each piece of music to be manually tagged they combined automatically extracted audio features from the signal along with some training into a machine learning algorithm.

I think it's interesting that Pandora completely discards collaborative filtering data. Sure, doing that solves the cold start problem, but I feel a combined approach gives the best recommendation results.

I have always wanted to crack the signal to extract features and do an online experiment to see how well it performs. Someone once suggested to me that a recommender based solely on audio features would simply recommend music to you that sounds like but is not as entertaining as what you already know.

Last, if your readers are interested in reading more about this field, allow me to direct them to ISMIR at, the International Society for Music Information Retrieval. ISMIR 2009 begins next Monday.


jeremy said...

Content-based music retrieval is, imho, some of the most interesting research out there. It may not be very mass-market, but from a geek perspective, it is utterly fascinating. I worked on it as part of the original OMRAS project from 1998-2005. Wish I were still doing it! :-)

jeremy said...

See also the tutorial/overview from Nicola Orio, a long-time ISMIR cohort, published in 2006:

Norman said...

In the long term I see automatic content-based recommendation for music becoming much more prominent, probably informed by collaborative filtering.

But in the mean time the semantic information collected around it (tags above all, but not just) captures the essence of a track/artist with much less effort.

jeremy said...

But in the mean time the semantic information collected around it (tags above all, but not just) captures the essence of a track/artist with much less effort.

Could you give some references, Norman? What do you mean by "essence"?

Because for a long time I used to teach social dance classes (tango, cha cha, west coast swing, etc.) And a lot of the songs that were good songs to dance the cha-cha with are not actually ever labeled in any folksonomies that I've seen with the tag "cha cha".

For example, here are the tags for "Golden Sands" by Paul Weller. It is a great song to "cha cha" to, but most people listening to Paul Weller don't cha cha. So he never actually gets the tag:

Are all the existing tags "true" or "correct"? Pretty much. Do the capture the essence of the song? IMO not at all.

Norman said...

@jeremy: probably "essence" was not the best term. :)
I was referring at the higher semantic value than what can be extracted via audio analysis.

It is true that rhythm can be inferred quite reliably with an algorithm, but I have an hard time imagining the same approach for - say - "videogame music".

Tags are just one of such sources. Wikipedia or musicbrainz would be able to help in your specific case much more.

Mike said...

@Tom: Can you remember the name of the music startup? I really would love to know since at we're also using a machine learning algorithm instead of humans for our content-based recommendations.

Tom Butcher said...

Mike> the name of the startup I mentioned is Mongo Music. Microsoft purchased the company around 2000 or 2001.