Monday, September 10, 2007

Learning forgiving hashes

Googlers Shumeet Baluja and Michele Covell had a paper at IJCAI 2007, "Learning 'Forgiving' Hash Functions: Algorithms and Large Scale Tests" (PDF).

The paper describes finding similar songs using very short audio snippets from the songs, a potential step toward building a music recommendation system. The similarity algorithm used is described as a variant of locality-sensitive hashing.

First, an excerpt from the paper describing LSH:
The general idea is to partition the feature vectors into subvectors and to hash each point into separate hash tables... Neighbors can be [found by] ... each hash casting votes for the entries of its indexed bin, and retaining the candidates that receive some minimum number of votes.
What is interesting (and a bit odd) about this work is that they used neural networks to learn the hash functions for LSH:
Our goal is to create a hash function that also groups "similar" points in the same bin, where similar is defined by the task. We call this a forgiving hash function in that it forgives differences that are small.

We train a neural network to take as input an audio spectrogram and to output a bin location where similar audio spectrograms will be hashed.
A curious detail is that initializing the training data by picking output bins randomly worked poorly, so instead they gradually change the output bins over time, allowing them to drift together.
The primary difficulty in training arises in finding suitable target outputs for each network.

Every snippet of a song is labeled with the same target output. The target output for each song is assigned randomly.

The drawback of this randomized target assignment is that different songs that sound similar may have entirely different output representations (large Hamming distance between their target outputs). If we force the network to learn these artificial distinctions, we may hinder or entirely prevent the network from being able to correctly perform the mapping.

Instead of statically assigning the outputs, the target outputs shift throughout training ... We ... dynamically reassign the target outputs for each song to the [bin] ... that is closest to the network's response, aggregated over that song's snippets.

By letting the network adapt its outputs in this manner, the outputs across training examples can be effectively reordered to avoid forcing artificial distinctions .... Without reordering ... performance was barely above random.
I have not quite been able to make up my mind about whether this is clever or a hack. On the one hand, reinitializing the target outputs to eliminate biases introduced by random initialization seems like a clever idea, perhaps even one that might have broad applicability. On the other hand, it seems like their learning model has a problem in that it does not automatically learn to shift the outputs from their initial settings, and the reordering step seems like a hack to force it to do so.

In the end, I leave this paper confused. Is this a good approach? Or are there ways to solve this problem more directly?

Perhaps part of my confusion is my lack of understanding of what the authors are using for their similarity metric. It never appears to be explicitly stated. Is it that songs should be considered similar if the difference between their snippets is small? If so, is it clear that is what the NNet is learning?

Moreover, as much as I want to like the authors idea, the evaluation only compares their approach to LSH, not to other classifiers. If the goal is to minimize the differences between snippets in the same bin while also minimizing the number of bins used for snippets of any given song, are there better tools to use for that task?


Anonymous said...

Yahoo Research (and Goldsmiths University of London) beat 'em to this idea (LSH for musical similarity) by a year. The following paper was published at ISMIR 2006:

Greg Linden said...

Thanks, Jeremy. That's an excellent paper. I appreciate the reference.

I like the approach of that Yahoo Research work of summarizing the audio snippets (as described in Section 3) much better than the Google approach, which has the feel to me of blindly throwing a NNet at the problem.

However, it is hard to know how the results compare. The Google authors directly compare their approach to LSH near the end of their paper, providing what appears to be compelling evidence of its performance. Do you think the Yahoo Research work would perform better than the LSH variant the Googlers tested? If so, why?

Anonymous said...

I haven't had a chance to fully digest the Google approach, yet. So I don't have an intuition about which would perform better.

But there is a way of finding out: MIREX!


All it takes is for the authors to participate in this evaluation suite. Neither of them have. Yet.

Anonymous said...

D'oh, that last comment should be "jeremy", not "anonymous"