Saturday, August 30, 2008

KDD talk on the Future of Image Search

Jitendra Malik from UC Berkeley gave an enjoyable and often quite amusing invited talk at KDD 2008 on "The Future of Image Search" where he argued that "shape-based object recognition is the key."

The talk started with Jitendra saying image search does not work well. To back this claim, he showed screen shots of searches on Google Image Search and Flickr for [monkey] and highlighted the false positives.

Jitendra then claimed that neither better analysis of the text around images nor more tagging will solve this problem because these techniques are missing the semantic component of images, that is, the shapes, concepts, and categories represented by the objects in the image.

Arguing that we need to move "from pixels to perception", Jitendra pushed for "category recognition" for objects in images where "objects have parts" and form a "partonomy", a hierarchical grouping of object parts. He said current attempts at this have at most 100 parts in the partonomy, but it appears humans commonly use 2-3 orders of magnitude more, 30k+ parts, when doing object recognition.

A common theme running through the talk was having a baseline model of a part, then being able to deform the part to match most variations. For example, he showed some famous examples from the 1917 text "On growth and form"
then talked about how to do these transformation to find the closest matching generic part for objects in an image.

A near the end of the talk, Jitendra contrasted the techniques used for face detection, which he called "a solved problem", with the techniques he thought would be necessary for general object and part recognition. He argued that the techniques used on face detection, to slide various sized windows across the image looking for things that have a face pattern to them, would both be too computationally expensive and have too high a false positive rate to do for 30k objects/parts. Jitendra said something new would have to be created to deal with large scale object recognition.

What that something new is is not clear, but Jitendra seemed to me to be pushing for a system that is biologically inspired, perhaps quickly coming up with rough candidate sets of interpretations of parts of the image, discounting interpretations that violate prior knowledge on objects that tend to occur nearby to each other, then repeating at additional levels of detail until steady state is reached.

Doing object recognition quickly, reliably, and effectively to find meaning in images remains a big, hairy, unsolved research problem, and probably will be for some time, but, if Jitendra is correct, it is the only way to make significant progress in image search.

6 comments:

rnc said...

Stuff that Numenta (http://www.numenta.com/) has been doing with hierarchical temporal memory seems to me the way to go for human-like vision/object recognition

Anonymous said...

I'm totally pro-content-based methods.

In principle, tagging can solve many of these issues. In practice it never will. That's because the effort required to tag images with the intentionality or the granularity that you need will never outpace the rate at which images are created.

Suppose for example that someone does tag a photo with the tag "fish", and you are looking for photos of "fish". Well, what if you also want to specify type of fish? Or color of fish? Or orientation of the fish within the image? Or the thinness or the fatness of the fish?

All of these things are possible to describe in tags. But I seriously doubt that most, if not asymptotically all, photos in the world will ever get that many tags applied to them.

I disagree with Jitendra that more tagging will not solve this problem. I think that more tagging *would* actually solve this problem. The issue, though, is whether you'll ever get enough effort to actually annotate every single image with all these tags.

Again, the answer is no, simply because all those attributes of the image are too far down the "long tail of effort", meaning that you'll never actually get enough people describing enough things about the image, for enough images.

What does Jitendra estimate, 30k attributes PER IMAGE? Let's suppose it is even 500, rather than 30,000. You'll never get that many tags for most images.

So the net effect or outcome is the same.. tags won't get you there.

The rate at which the number of photographs in the world appear far exceeds the rate at which new taggers appear, or even the rate at which old taggers continue to play tag labeling games.

You need content-based methods.

Greg Linden said...

Hi, Jeremy. Just to clarify, I meant that Jitendra said 30k parts in the visual model (across all possible images), not per image, but I don't think that detracts from your point.

Anonymous said...

I would agree with Ricardo save for one point: while Numenta appears to have a handle on the "partonomy" part of the equation (with their concept of "invariant representations"), the query model for an HTM is a big mismatch with current Web search query interfaces. To query an HTM, you need to give it some form of the thing you're looking for (in part, in whole but deformed, etc). This would seem to be at odds with the keyword-based search that is most popular today and would be a significant hurdle into introducing HTM-based image search for general purpose consumption on a site like Flickr or Google. I remain hopeful, however, as some of the results people are getting with Numenta's stuff are encouraging.

Anonymous said...

Yes, my mistake. The model has 30k, not every single image.

But yes, every single image still requires more than the couple of dozen tags than you get using the ESP game to label and image. Especially if you want any sort of granularity at all.

And you absolutely need this granularity. I just did a search for "fish" on Flickr. There are 2.29 million images with this tag. And this isn't even the whole web!

Anonymous said...

Greg, I'd love to hear your thoughts on the face-recognition feature Google released this week in their PicasaWeb app.

Works amazingly well for me, distinguishing between members of my family with similar facial features. Works regardless of lighting, hairstyle, hair color, hats, 3/4 profile, etc.