Thursday, December 31, 2009

YouTube needs to entertain

Miguel Helft at the New York Times has a good article this morning, "YouTube's Quest to Suggest More", on how YouTube is trying "to give its users what they want, even even when the users aren't quite sure what that is."

The article focuses on YouTube's "plans to rely more heavily on personalization and ties between users to refine recommendations" and "suggesting videos that users may want to watch based on what they have watched before, or on what others with similar tastes have enjoyed."

What is striking about this is how little this has to do with search. As described in the article, what YouTube needs to do is entertain people who are bored but do not entirely know what they want. YouTube wants to get from users spending "15 minutes a day on the site" closer to the "five hours in front of the television." This is entertainment, not search. Passive discovery, playlists of content, deep classification hierarchies, well maintained catalogs, and recommendations of what to watch next will play a part; keyword search likely will play a lesser role.

And it gets back to the question of how different of a problem Google is taking on with YouTube. Google is about search, keyword advertising, and finding content other people own. YouTube is about entertainment, discovery, content advertising, and cataloging and managing content they control. While Google certainly has the talent to succeed in new areas, it seems they are only now realizing how different YouTube is.

If you are interested in more on this, please see my Oct 2006 post, "YouTube is not Googly". Also, for a little on the technical challenges behind YouTube recommendations and managing a video catalog, please see my earlier posts "Video recommendations on YouTube" and "YouTube cries out for item authority".

Monday, December 28, 2009

Most popular posts of 2009

In case you might have missed them, here is a selection of some of the most popular posts on this blog in the last year.
  1. Jeff Dean keynote at WSDM 2009
    Describes Google's architecture and computational power
  2. Put that database in memory
    Claims in-memory databases should be used more often
  3. How Google crawls the deep web
    How Google probes and crawls otherwise hidden databases on the Web
  4. Advice from Google on large distributed systems
    Extends the first post above with more of an emphasis on how Google builds software
  5. Details on Yahoo's distributed database
    A look at another large scale distributed database
  6. Book review: Introduction to Information Retrieval
    A detailed review of Manning et al.'s fantastic new book. Please see also a recent review of Search User Interfaces.
  7. Google server and data center details
    Even more on Google's architecture, this one focused on data center cost optimization
  8. Starting Findory: The end
    A summary of and links to my posts describing what I learned at my startup, Findory, over its five years.
Overall, according to Google Analytics, the blog had 377,921 page views and 233,464 unique visitors in 2009. It has about 10k regular readers subscribed to its feed. I hope everyone is finding it useful!

Wednesday, December 16, 2009

Toward an external brain

I have a post up on blog@CACM, "The Rise of the External Brain", on how search over the Web is achieving what classical AI could not, an external brain that supplements our intelligence, knowledge, and memories.

Tuesday, December 08, 2009

Personalized search for all at Google

As has been widely reported, Google is now personalizing web search results for everyone who uses Google, whether logged in or not.

Danny Sullivan at Search Engine Land has particularly good coverage. An excerpt:
Beginning today, Google will now personalize the search results of anyone who uses its search engine, regardless of whether they've opted-in to a previously existing personalization feature.

The short story is this. By watching what you click on in search results, Google can learn that you favor particular sites. For example, if you often search and click on links from Amazon that appear in Google's results, over time, Google learns that you really like Amazon. In reaction, it gives Amazon a ranking boost. That means you start seeing more Amazon listings, perhaps for searches where Amazon wasn't showing up before.

Searchers will have the ability to opt-out completely, and there are various protections designed to safeguard privacy. However, being opt-out rather than opt-in will likely raise some concerns.
There now appears to be a big push at Google for individualized targeting and personalization in search, advertising, and news. Google now appears to be going full throttle on personalization, choosing it as the way forward to improve relevance and usefulness.

With only one generic relevance rank, Google has been finding it is increasingly difficult to improve search quality because not everyone agrees on how relevant a particular page is to a particular search. At some point, to get further improvements, Google has to customize relevance to each person's definition of relevance. When you do that, you have personalized search.

For more on recent moves to personalize news and advertising at Google, please see my posts, "Google CEO on personalized news" and "Google AdWords now personalized".

Update: Two hours later, Danny Sullivan writes a second post, "Google's Personalized Results: The 'New Normal' That Deserves Extraordinary Attention", that also is well worth reading.

Thursday, December 03, 2009

Recrawling and keeping search results fresh

A paper by three Googlers, "Keeping a Search Engine Index Fresh: Risk and Optimality in Estimating Refresh Rates for Web Pages" (not available online), is one of several recent papers looking at "the cost of a page being stale versus the cost of [recrawling]."

The core idea here is that people care a lot about some changes to web pages and don't care about others, and search engines need to respond to that to make search results relevant.

Unfortunately, our Googlers punt on the really interesting problem here, determining the cost of a page being stale. They simply assume any page that is stale hurts relevance the same amount.

That clearly is not true. Not only do some pages appear more frequently than other pages in search results, but also some changes to pages matter more to people than others.

Getting at the cost of being stale is difficult, but a good start is "The Impact of Crawl Policy on Web Search Effectiveness" (PDF) recently presented at SIGIR 2009. It uses PageRank and in-degree as a rough estimate of what pages people will see and click on in search results, then explores the impact of pages people want more frequently.

But that still does not capture whether the change is something people care about. Is, for example, the change below the fold on the page, so less likely to be seen? Is the change correcting a typo or changing an advertisement? In general, what is the cost of showing stale information for this page?

"Resonance on the Web: Web Dynamics and Revisitation Patterns" (PDF), recently presented at CHI, starts to explore that question, looking at the relationship between web content change and how much people want to revisit the pages, as well as thinking about the question of what is an interesting content change.

As it turns out, news is something where change matters and people revisit frequently, and there have been several attempts to treat real-time content such as news differently in search results. One recent example is "Click-Through Prediction for News Queries" (PDF), presented at SIGIR 2009, that describes one method of trying to know when people will want to see news articles for a web search query.

But, rather than coming up with rules for when content from various federated sources should be shown, I wonder if we cannot find a simpler solution. All of these works strive toward the same goal, understanding when people care about change. Relevance depends on what we want, what we see, and what we notice. Search results need only to appear fresh.

Recrawling high PageRank pages is a very rough attempt at making results appear fresh, since high PageRank means a page more likely to be shown and noticed at the top of search results, but it clearly is a very rough approximation. What we really want to know is: Who will see a change? If people see it, will they notice? If they notice, will they care?

Interestingly, people's actions tell us a lot about what they care about. Our wants and needs, where our attention lies, all live in our movements across the Web. If we listen carefully, these voices may speak.

For more on that, please see also my older posts, "Google toolbar data and the actual surfer model" and "Cheap eyetracking using mouse tracking".

Update: One month later, an experiment shows that new content on the Web can be generally available on Google search within 13 seconds.