Thursday, December 07, 2006

YouTube cries out for item authority

When I worked at Amazon, there was a lot of effort into recognizing that two items in the catalog were actually the same item. That was called item authority.

I was recently browsing around in YouTube and I noticed how bad the site is about dealing with multiple copies of the same content. For example, on Weird Al's video, "White & Nerdy", look at the related videos.

The first four are all copies of the same video. They are not "related"; they are the same video.

Of the first ten videos in that list, only three are unique. The others are all duplicates.

This problem is not unique to YouTube. On Google Video, "White & Nerdy", eight of the top ten "related" videos are identical copies of the Weird Al music video.

The point of showing me related content is to help me discover new and interesting content. Showing identical copies of the same video I just watched is not useful to me.

What is useful is helping me find interesting other videos. At a minimum, you could screen out duplicates and then show other Weird Al videos; that would be useful, if a bit obvious. Alternatively, you could show videos that interest people who liked "White & Nerdy", using other customers' actions to help me find interesting content.

Crawling the world's information is not enough. You need to make that information useful. You must help people find relevant information, help people find the information they need.


leafar said...

For content, i'll even add: be able to find it in different languages (especially if the language is not a barrier, like for music)

Paul said...

It's funny that you write about this topic, I was thinking about the same thing with the Findory - looking at the default findory front page (the page you see if you are not logged in) - right now there are five stories related to Britney's underwear - now unlike YouTube these are not the same story - but still it seems that the same item authority technique could be used to identify a single representative Britney story leaving room for other, perhaps more interesting stories.

elias said...

It would also be nice if they figured out some way to deal with sequences. (part 1/n, part 2/n, ...)

RobotsThink said...

Redundancy is always an issue.But how to outcome that in video category.Even if you say we have metadata for the videos, these depend on the user, who can fill at random.Can we do frame-by-frame match ! ;)

Jim Rait said...

reminds me of the quote:
“The ‘surplus society’ has a surplus of similar companies, employing similar people, with similar educational backgrounds, coming up with similar ideas, producing similar things, with similar prices and similar quality.”Kjell Nordstrom and Jonas Ridderstrale,from their book Funky Business.
Perhaps we should write a tome "Wisdom of Herds?"

Greg Linden said...

Those in glass houses, eh, Paul? Heh, heh, that's a good point. Findory does try to do item authority, but makes a fair number of mistakes.

In general, if two articles are about the same topic but have no keywords in common, it is a hard (as in natural language understanding hard) problem to recognize that they are about the same topic. Even humans are likely to disagree sometimes about whether two articles with different perceptives or in different points in time about similar events are the same.

That may be a bit of an excuse for Findory, which is a tiny company, but it is no excuse for the $1.6B YouTube or the $150B Google, especially given that there are a bunch of simple steps YouTube and Google Video could take that might solve 70-80% of the problem.

Ted said...

Re: RobotsThink's comment: Content-based deduping of videos and audio is especially a challenge, because what can be considered the same "essence" can be rendered in multiple formats, sampling rates, frame sizes, &c. In the audio world, this challenge has been addressed by careful design of audio fingerprinting methods.