Geeking with Greg: 10/01/2006

Sunday, October 29, 2006

RSS readers are dead, long live RSS readers

Richard MacManus notes the shutdown of Pluck's feed reader and says:

Microsoft is integrating RSS into Outlook next year, Google will probably have Gmail integration soon, and Yahoo has MyYahoo and Yahoo Mail for feeds.

Consumer RSS Readers are rapidly becoming commodities and will soon be next to worthless ... Bloglines and Rojo both got out while the going was still good, via acquisitions.

Consumer RSS Readers are a dead market now.

This day has long been predicted by many including me. Back in Sept 2004, I said:

RSS is already integrated into Firefox and probably will soon be in Safari, IE, and Mozilla. My Yahoo already has a web-based feed reader in beta; Google and MSN may follow soon.

Will independent feed readers survive the entry of these giants?

But all is not lost. As I said in my earlier post, "RSS sucks and information overload", there is hope for those that do more than current generation of feed readers:

The problem is that the current generation of feed readers merely reformat RSS for display. They don't do anything else, no prioritization, no filtering, no help dealing with the flood of information.

The problem [should be] scaling attention. Readers have limited time. They don't want information. They want knowledge. Our job is to help them, to help them focus, prioritize, and find what they need.

Next-generation feed readers should help people find knowledge. Cut through the undifferentiated glut of information and find focus. Cut through the noise and discover knowledge.

See also my previous posts, "Getting your grandmother to use RSS" and "A relevance rank for news and weblogs".

More on Netflix contest

Some interesting tidbits on the Netflix movie recommendations contest lately.

First, 27 days after the start of the contest, there are now 36 entries that beat the performance of Netflix Cinematch. The top entry already has a 5% improvement. It is still a long way from the 10% improvement required to win the grand prize of $1M, but the gap is closing.

Second, there is a fun post in the Netflix contest forums by Benji Smith about the "most hated", "most loved", and "most contentious" popular movies according to the Netflix data. [Found via kottke.org]

Third, Netflix's VP of Recommendation Systems Jim Bennet gave a Sept 2006 talk (PDF) about their Cinematch recommender system. The talk mentions that Cinematch uses an item-to-item algorithm -- the same type of algorithm used by Amazon.com's recommender system -- and includes some nice tidbits such as the characteristics of movies that they can accurately predict. At the end of the talk, Jim provides some justification for why Netflix is spending $1M on this contest, saying that higher quality recommendations are "absolutely critical to retaining users." [via Recommenders06]

See also my original post on the Netflix contest, "Netflix offers $1M prize for improved recs".

Spamming YouTube

In his post, "Gaming YouTube for Fun and Profit", Pete Cashmore looks at how easy it is to get a video featured in YouTube's most popular lists:

YouTube is increasingly being gamed by users.

It's exceptionally easy to rank among the most viewed videos and channels by simply refreshing the page. This widespread gaming might also throw into question YouTube's claim of serving 100 million videos per day.

I grabbed a video called "Sheep" ... and re-uploaded it under the username "themusichall". I realized that I'd lost the audio in the process (converted it to the wrong format), but decided to leave it like that - nobody would voluntarily share a 7 second clip with no audio.

I then set the page to refresh itself over Sunday night and - sure enough - it was among the most viewed clips this morning. Admittedly, 10,000 or so views doesn't get you to the top - it's on the 3rd "Most Viewed" page and ranks 10th in the Comedy category ... [but] it's pretty obvious how you could attain the number one spot.

See also my previous post, "Digg struggles with spam".

See also my earlier post, "YouTube is not Googly".

[Pete Cashmore post found via Matt Marshall]

Update: Just two months later, Mark Cuban looks at the top videos for Dec 2006 on YouTube and says, "That's what Youtube has become. Fake Porn and Commercials ... How long before fake porn just takes over? It was 9 of the top 20 for the week as I write this." Dare Obasanjo adds, "I doubt Google spent $1.62 billion for Youtube just to watch it turn into a haven for fake porn."

Wednesday, October 25, 2006

Going to Web 2.0

I (and therefore Findory) will be at the Web 2.0 Conference in San Francisco November 7-9.

Web 2.0 looks like a great event. The list of speakers and attendees is impressive. It should be a lot of fun!

Tuesday, October 24, 2006

Google Custom Search and instant answers

As discussed by many others ([1] [2] [3] [4] [5] [6] [7]), Google announced a new feature rather boringly named "Google Custom Search Engine" that allows anyone to "create a search engine that reflects your knowledge and interests; looks and feels like your own; and ... make money from the traffic you receive."

The actual implementation is not quite as grand as that. As Greg Sterling said, it is basically "industrial-strength Rollyo" where people can limit to or favor search results from specific domains and bias the search results to favor matches to given keywords.

Even so, it is fun and let me easily test an idea I had been kicking around to bias search results to sites that provide high quality answers like Wikipedia. Here is an example inline:

Search for Answers

and here is the same custom search hosted on Google. Try a search like [San Francisco] on this custom search engine and compare to the same search on Google. Medical searches like a search for [arthritis] also seem a bit more useful to me.

This custom search takes the bias Google Web search results seem to have toward Wikipedia a step further. It also favors a few other sites that focus on providing answers. The results are pretty good for questions and requests for factual information, but it obviously is only a twiddle on top of what Google already does, and it does not help for all types of queries (e.g. navigational queries).

By the way, it is not obvious to me that a bias toward Wikipedia or other sources is the right thing in the general case. The existing ranking algorithms already attempt to bias toward authoritative sources. At best, if you like these results better, it might mean that a few particular sources should be considered more authoritative than they already are.

In any case, it is fun to be able to play with this kind of thing so easily and quickly.

Ongoing confusion about Live brand

Mary Jo Foley describes the confusion about the Windows Live brand and how it relates to Windows, Microsoft, and MSN:

Microsoft is actually making some real headway in the way it is developing and distributing services, but almost no one knows it, thanks to the abysmal job the company has done in defining Live and updating the various Microsoft constituencies on its progress.

"Live" is Microsoft shorthand for services. Windows Live is not, as many still assume, a new, hosted version of Windows; it is the set of services extensions to Windows. The same is true of Office Live. (Confusingly, however, CRM Live is a hosted version of Microsoft CRM.)

There's ... been mass confusion ... about exactly what Live is and how Microsoft's Live strategy hangs together ... Analyst Matt Rosoff [said], "It seems like end-users who don't particularly follow Microsoft have never heard of Live or confuse it with the next version of Windows, and customers, partners, and advertisers often express puzzlement over the difference between Windows Live, Live (e.g., Live Search), and MSN."

Microsoft has used "Live" to mean several different things, Rosoff added. "We ... view Windows Live as ... essentially the latest chapter in the long story of MSN. But sometimes the Live brand is also used to describe broader concepts, such as software being delivered as a service or subscription-based models for buying software. I don't think the brand has been as misused as .Net was a few years back, but it's still fairly indistinct."

See also my previous posts, " Is it Windows Live, MSN, or Microsoft?", "Is it Live or MSN?", and "Office Live is not Office Live?".

Update: Three months later, after a disappointing quarterly report, Gartner analyst David Smith says, "Microsoft's Live branding has been tremendously confusing and has hurt the company, and it is very likely contributing to the situation they are in right now. They've created another brand and have not differentiated it." [Found on Findory]

Update: Three months later, Mary Jo Foley makes a Herculean effort to understand how the Live brand is used and writes, "After going out and searching for information on all the Live services I could find, I feel like I'm even more confused about what Windows Live is (and isn't) than I was before."

Saturday, October 21, 2006

Geeking with Weird Al

I thought Weird Al Yankovic would never do better at capturing the essence of geek than his song "It's All About the Pentiums" (YouTube video).

But, his new song, "White and Nerdy" (Google Video), also is inspired. One of my favorite parts: "There's no killer app I haven't run / At Pascal, well I'm number one / Do vector calculus just for fun / I ain't got a gat but I got a soldering gun." Yep. As Weird Al says, "How'd I get so white and nerdy?"

If you're a geek and you haven't seen both of these videos, take the time to check them out. Definitely good for a laugh.

Reddit, Digg, and personalized news

Emre Sokullu and Richard MacManus at Read/WriteWeb wrote an article, "Personalized News: A Market Overview", that covers several startups trying to recommend news articles and build personalized front pages for readers.

Some excerpts:

Our guess is that personalized content will become a more popular paradigm in about 1 to 2 years.

Personalized news has a couple of main attractions. Theoretically, if your news is personalized then it's not as vulnerable to gaming as [Digg's] power of masses approach. Plus people are getting busier everyday, so personalized news has a strong appeal as a potential solution for information overload.

The article mostly talks about Reddit. I found this a bit odd, since Reddit strikes me as closer to Digg than a personalized news site, but Reddit does have a "recommended articles" page off their front page.

As the article explains, Reddit appears to use a keyword-based approach like the Bayesian filter Paul Graham developed for spam filtering. I doubt that this simple content-based approach can be made to work well, And, unfortunately, as Emre said, "many [Reddit] users still complain about not receiving relevant news recommendations."

I went back to try Reddit's recommendations again -- I hadn't looked at it in a while -- and, after rating a few articles, it still put up a message that "no links have been recommended for you yet. keep telling reddit what you like and dislike by voting on links, and check back here later for recommended links." I never got to the point where I actually received recommendations. That is a problem. Recommender systems need to work from sparse data in real-time. They need to react immediately and instantly to new data.

After rating ten articles, mostly about Google (on GDrive, MapReduce, Sawzall, BigTable), on Reddit, I finally received one recommendation, "Fun with Javascript". After a few more minutes, I received a few more recommendations, one a political article about stem cells, one on messing with telemarketers, and one on the "Father and Son story of the century". So, yes, I agree, not relevant.

Emre suggests that all personalized news sites are like Reddit. Findory does not use Bayesian analysis over keywords. Instead, Findory uses a form of social filtering where Findory readers anonymously and implicitly share the articles they find and enjoy with other Findory readers.

Oversimplifying a little, Findory works a bit like Digg except that rather than seeing a front page of the generally most popular articles, you see a front page of the articles that are most popular for readers like you. As Emre said, different lists for different people reduces the incentive to game the system by eliminating the winner-takes-all effect.

In general, the power of the masses approach, epitomized by Digg, has two problems with relevance. First, a most popular list is generic and untargeted; it is only relevant as long as your interests match those of the entire community. Second, as power of masses sites reach a mainstream audience, the incentive to spam grows and relevance drops. Personalized news has neither of these issues.

Finally, it is worth noting that personalized news is not limited to crazy little startups. Both Google News and Microsoft's MSN NewsBot have a small widget on their front pages that recommend news stories based on the articles you read.

Findory Q3 2006 traffic

Findory's Q3 traffic numbers are in. The good news is, after a dip, there is a modest increase from the previous quarter. The bad news is that, after exponential growth for two years, Findory's traffic now appears to be flat.

The flat traffic almost certainly is due to lack of resources. Findory remains a self-funded, small startup. That provides a lot of flexibility, but limits the company's ability to address larger projects.

In particular, Findory has not had the resources to address big opportunities such as international expansion, licensing deals, personalized web search, improving our feed reader and customization options, or creating a behavioral advertising network using Findory's personalization.

Findory also has lacked capital for traditional marketing; our advertising budget is not perceivably different than zero. This has limited Findory's ability to reach out to the mainstream audience it seeks.

It is a challenging question what to do next. Growth clearly requires additional funding for Findory. But, those additional resources create some opportunities and come with constraints that limit others. The company is at a choice point that requires some careful thought.

By the way, I have received some advice to stop posting these traffic numbers since they now may be viewed negatively. While I agree that is a concern, I think that it is best to be transparent with this kind of data. I will continue posting Findory's traffic stats every quarter.

See also my previous posts ([1] [2] [3] [4] [5] [6] [7]) about Findory's traffic over the last several quarters.

Eric Schmidt on personalized information

In Google's Q3 2006 earnings call, Google CEO Eric Schmidt emphasized personalization of information in Google's mission and future plans.

Some excerpts:

We believe that people's information and the information they want to receive ... needs to be accessible when and where they want it for them in a very personalized way.

The interesting thing is that this approach to having your information personalized is a benefit not only for the user who can continue to refine and target information ... but also for businesses who want to know they are spending their money in an effective and targeted way.

As we continue to innovate and bring out ... new products, we'll also continue to ... improve the experiences, bringing the most personalized and targeted information to people, which is ultimately our mission.

[We] provide access to the world's information ... [and] organize it in a very personalized and targeted way. That benefit drives the entire cycle of Google, and it's fundamental.

[Conference call transcript link found via Paul Kedrosky]

Friday, October 20, 2006

What should Google do next?

Jessica Mintz from the AP has a fun article asking a few entrepreneurs and analysts, "What should Google do next?"

An excerpt from the article:

Imagine you're Google right now: darling of Wall Street, $10 billion dollars in the bank, the brand-new owner of one of the hottest video-sharing sites on the map.

What would other high-tech entrepreneurs do now if they were running Google?

"There's so much work that needs to be done around search," [Technorati CEO Dave] Sifry said. [We need] "a Google of intention," driven by users' real-time needs.

Nick Denton, publisher of ... Gawker Media, wrote ... "I would get search working, because the results are cluttered with commercial rubbish that ought really to be in the advertising zone."

Henry Copeland, founder of the BlogAds network, wrote ... "Use the collective intelligence of its users to eradicate spam."

There were, of course, a few more outlandish ideas -- buying online auctioneer eBay Inc., for example -- as well as a few geeky ones involving free database design software.

But all in all, the consensus seems to be that Google needs stay focused -- deepen the moat around the core search business, as one analyst said.

I am briefly quoted in the article arguing that Google should go after e-commerce. I do think they should do that, but my complete response about what Google should do, appended below, emphasized core search and advertising more than e-commerce:

I would not stray far from Google's area of strength. Google is good at one thing, using technology to help people find the information they need. This job is far from done; abundant opportunities remain.

Aside from expanding their core search and advertising products with question answering, personalization, machine translation, and query refinement, Google should aggressively expand into e-commerce and look to helping content producers make money.

On e-commerce, Google should be helping people find and discover products they want to buy. Froogle, an early Google effort at shopping metasearch, has gone stagnant. This is an opportunity lost, a market the size of eBay and Amazon combined, and one that may grow to the size of Wal-mart. Google should be the first stop for anyone wanting to buy anything online.

On helping content producers, newspapers and magazines see their audience moving online, but are frustrated by the low revenues earned from an online audience. The market opportunity here is the size of the entire media business. Google already has taken a step toward helping small content producers with AdSense, but revenues and relevance need large gains to be able to support the mainstream media. Google should be expanding and improving its advertising network. Google needs to use techniques such as personalization and geolocation to make advertising more relevant, useful, and lucrative for content producers.

It is possible that Google could speed these initiatives along with large acquisitions -- acquiring Amazon.com, for example, would give a big boost in e-commerce -- but they will not. Instead, Google will use their $10B in cash to hire great people, continue building out their server cluster, and do hundreds of very small acquisitions. For the most part, Google will build instead of buying.

I realize my prediction may not be as exciting as a big merger or completely new initiative, but the right thing to do is not always exciting. I think this is what is mostly likely to build value and what Google is most likely to do.

What do you think? Did I miss a big opportunity for Google? Am I overestimating the value of e-commerce and helping content producers? What do you think Google should do next?

Thursday, October 19, 2006

Google A/B tests for all

Google Website Optimizer appears to allow other websites to run A/B tests to optimize the effectiveness of their designs, features, algorithms, and content.

From their help page:

The Website Optimizer allows you to test changes in the website content of your pages in order to determine what will be most effective in getting conversions.

You choose what parts of a page you'd like to test -- headline, image, promo text -- and we'll run an experiment on a portion of your site traffic to determine which content on your site users respond to best.

When we've collected enough data, we'll provide you with reliable reports and a suggested course of action in order to optimize your site for maximum business results.

Much more detailed information in their technical overview.

It is no longer just Amazon and Google who have powerful A/B testing frameworks. Google appears to have just opened up their A/B testing tools to everyone.

[Found via Liam Morrison and Philipp Lenssen]

AdSense will not do behavioral targeting?

In a recent interview, Kim Malone (Director in the AdSense group at Google) said, "Behavioral targeting is not something that Google [AdSense] will do."

Kim could not be more clear in this outright, flat denial, but I simply do not believe it. There is too much to gain. Ads targeted to individual behavior are more relevant, useful, and interesting to users and convert better for advertisers.

If Kim is telling the truth here -- if Google really will not get into behavioral targeting -- I would say this represents a beautiful opportunity for Yahoo, Amazon, and Microsoft to trounce AdSense. Use a little more data, pay a little closer attention to what people want and need, and your ads will be more relevant and less annoying that Google's contextual ads.

While they have been painfully slow to execute, Google's competitors are moving on behavioral targeting.

Microsoft adCenter lets advertisers target ads "to the customers most interested in your products or services." Microsoft adLab, the research arm of adCenter, says their goal is to "change online advertising dramatically in areas such as paid search, behavioral targeting and contextual advertising", claims they have "years of profiling, user behavior, and data mining", and asserts that they "know a heck of a lot more about their audience than Google knows about its own."

Yahoo has been working with Revenue Science "to show text listings on Web pages based on user behavior." Yahoo also recently launched their Panama ad system which, according to Danny Sullivan, will include "audience targeting based on factors that could include demographic information or online behavior" in "future versions."

Amazon Omakase targets ads "based on what the Associate has been successful with in the past; what that user has been interested in; and what the site is about." In my opinion, other than Findory, this is the most dramatic example of fine-grained behavioral targeting out there right now. The Amazon Omakase ads I see on other websites are eerily relevant, targeted to my purchase and clickstream history at Amazon.

It is unbelievable to me that Google is not pursuing behavioral targeting. But, if Google is not pursuing behavioral targeting, it is clear that its competitors cheerfully will.

Friday, October 13, 2006

Talk on Google AdWords and AdSense

Shiva Shivakumar (Director, Google Kirkland) who "led AdSense through beta, launch and hypergrowth" gave a talk at University of Washington Computer Science called "Google Ad Systems" (video available).

The talk is a light introduction to AdWords and AdSense. It mostly covers the history of the products with some brief discussions of the challenges around relevance, scale, optimal auction pricing, and click fraud.

The talk is worthwhile, a good overview of the work involved in building a system like AdSense and AdWords.

I tend to follow this stuff fairly closely, so the only thing that was new to me in the talk was hearing that AdWords is still running on top of a massive MySQL deployment. AdSense, in contrast, is running on top of the Google data infrastructure, GFS and Bigtable. Shiva was not clear about whether the difference was merely due to legacy issues or whether there is something special about the AdWords data access patterns that makes MySQL preferable.

Shiva's talk unfortunately only touches briefly on auction theory and click fraud issues. If you are interested in more details there, you might dive into Jan Pedersen's talk and a related paper I discussed in an earlier post.

Wednesday, October 11, 2006

Yahoo's troubles

Saul Hansell at the NYT tells us that "At Yahoo, All Is Not Well". Some selected excerpts:

Yahoo would seem to have a strong hand. It is the world's most popular Web site, with more than 400 million monthly users ... But in recent months the company has suffered some embarrassing setbacks.

From video programming to social networking -- areas of interest to users and advertisers alike -- the company is losing its initiative. And each time a product fails in the market or is late, Yahoo loses some ability to do more deals and hire more talented employees.

Yahoo has been stymied because its text advertising business has been largely frozen until it completes a new software system. The upgrade is more than a year late ... Yahoo's [old] system produces much less money from every page than Google ... Google has $11 billion in cash and a market value of $131 billion, while Yahoo has $4 billion in cash and is worth $34 billion.

Current and former Yahoo employees say the company has been bogged down by bureaucracy and internal squabbling .... Companies that try to do deals with Yahoo also say they find it to be slow, demanding and inconsistent in negotiations.

Yahoo's faltering image and plunging stock price may also be hurting its ability to recruit talented people ... Yahoo's existing employees are grumbling that with the stock price so low, many of their options have become worthless. Some Yahoo veterans have bolted for trendier start-ups.

Of all the problems at Yahoo, I think the lengthy delays in competing against Google AdWords and AdSense are the worst.

The business is advertising. It is not tagging, sharing, chatting, or socializing. Ads drive everything. To fail to compete on advertising is to fail.

For more on what Microsoft and Yahoo should do to compete with Google, see my previous post, "Kill Google, Vol. 3".

For more on the failure of Microsoft and Yahoo to compete with Google, see my previous posts, "Yahoo and MSN cannot compete?" and "Yahoo gives up?"

Update: John Battelle says, "The main issue dogging [Yahoo]: its lack of a monetization engine as efficient as Google's ... If Yahoo is going to compete against Google ... it has to get search monetization up to snuff."

Monday, October 09, 2006

Progress on Netflix recommendations contest

Seven days after the start of the Netflix contest, the first entry appears on the leaderboard that beats the performance of Netflix's recommender engine Cinematch.

See also my earlier post, "Netflix offers $1M prize for improved recs".

Update: John Chandler-Pepelnjak notes in the comments to this post that a new entry for a team called "The Thought Gang" just qualified for the "Progress Prize" of $50k. Excellent.

Update: Two weeks after the start of the Netflix contest, there are now six entries that beat Netflix Cinematch. The top entry, "NIPS Reject", has a nearly 2% improvement. Impressive.

Update: Things are heating up. Eighteen days after the start of the contest, there are now thirteen entries that beat Netflix and nine that qualify for the Progress Prize. The top entry has a nearly 5% improvement, halfway to the Grand Prize of $1M.

Friday, October 06, 2006

YouTube is not Googly

Kevin Delaney at the WSJ says that "Google Inc. is in talks to acquire popular video-sharing site YouTube Inc. for roughly $1.6 billion."

This is a horrible idea.

YouTube is a collection of uploaded content. They have no interesting technology. All Google would be buying is YouTube's existing content and user base.

Google has never been about owning content and users before. Google makes it easier to find other people's content. Google's core strength is in helping people find and discover information, not in controlling information and people.

This merger would be classic deworsification. If it happens, GooTube will be exactly what Microsoft and Yahoo have been waiting for, a lovely little distraction for Google.

See also my previous post, "The problem with YouTube".

Update: On this note, Danny Sullivan points out an LA Times article that said, "Google admitted this year that its internal audits discovered that the company had been spending too much time on new services to the detriment of its core search engine."

Update: Mark Cuban calls Google buying YouTube "crazy" and "moronic" in his post, "Some thoughts on YouTube and Google". [via Don Dodge]

Update: It's official, Google bought YouTube. All hail GooTube.

Update: Om Malik says Google is a loser in this because he thinks "this is Compaq-DEC, Skype-eBay kind of a deal for them in the long run." John Battelle notes that "this marks Google's first significant out of brand acquisition, the company's first true brand-management challenge."

Update: When we look back at this merger in a few years, I suspect it will be seen as when Google jumped the shark. It is the day Google loses its focus on technology and begins a stumbling effort at trying to become a media company.

Update: Om Malik adds, "It is the distraction factor ... The copyright issues and all those other problems are going to strain google where it is weakest - management and control."

Update: Microsoft CEO Steve Ballmer says, "Right now, there's no business model for YouTube that would justify $1.6 billion. And what about the rights holders? At the end of the day, a lot of the content that's up there is owned by somebody else. The truth is what Google is doing now is transferring the wealth out of the hands of rights holders into Google." [via Todd Bishop]

Update: Ephraim Schwartz at InfoWorld says of the Google-YouTube deal:

YouTube gets something like 100 million page views per day. Does it matter that 99 percent of them are a waste of time? That these homemade videos have no redeeming quality? Not in the slightest. To whom should it matter?

It should matter to Google. What is Google's mission again?

Google's mission is to organize the world's information and make it universally accessible and useful.

If 99% of YouTube page views are a waste of time with no redeeming quality, the YouTube deal is a violation of Google's mission. YouTube content is not useful.

Update: No matter what you think of the YouTube deal, you have to admit that this part of it just reeks:

YouTube's deals with Universal and Sony BMG came hours before it announced its deal with Google ... the music companies rushed to complete the deal ahead of the YouTube deal, in part so that it could benefit in the jump in YouTube's value.

Update: "YouTube purges 30,000 copyright files", the first of many of these announcements we will see, I am sure.

Update: Mark Cuban posts some rumors about the details of the Google-YouTube deal that, if true, make Google's biz dev folks look pretty darn evil. Mike at TechDirt sums it up: "The music labels effectively taking a bribe to cause trouble for Google/YouTube video competitors, ignoring YouTube to let it grow for a while, and pocketing all of the money without giving it back to the artists they supposedly represent ... This whole thing reads pretty sleazy."

Update: The FT reports that "Google is engaged in a frantic round of negotiations ... [to] ward off a potentially crippling round of lawsuits ... offering tens of millions of dollars in upfront payments for the right to broadcast their video content legally on YouTube .... [and even] offering one $100m to license its content over a two-year period." [via Paul Kedrosky]

Update: Google closed the acquisition of YouTube today (Nov 13). Let the lawsuits begin.

Update: Four months later, SeekingAlpha reports, "Despite months of negotiations, Google has been unable to secure a deal to post content from any major media company on YouTube."

Update: Liz Gannes at GigaOm reports that "Google is having quite a bit of trouble ... figuring out how to monetize YouTube and make it legit."

Update: Five months later, Viacom sues GooTube for (insert Austin Powers voice here) one billion dollars. While much of the commentary I have seen seems to be favorable toward Google, I am more inclined toward the opinions of Don Dodge and Mark Cuban who argue that Viacom cannot lose by fighting hard against Google.

Update: About two years later, not only is YouTube failing to make money ("YouTube is still finding it hard to make money ... revenues at around $90 million for 2008" [1] and expenses are high), but also the lawsuits are creating nasty problems for Google as they creep through the courts (e.g. "Google Told to Turn Over User Data of YouTube" [2]).

All of this comes on top of new revelations from Chad Hurley that YouTube needed a Google bail-out at the time of the acquisition ("[Another VC round] would have been hard, we would have been even more threatening to [others] ... and it would have been hard for us to operate in an efficient way. So we decided Google was going to be our answer ... I don't really think [continuing growth] would have been possible without the help of Google." [3]).

Update: Four years later, Nick Bilton at the New York Times writes, "Although Google's purchase of YouTube hasn't paid of financially, it has clearly made Google a giant in the world of online video, displaying more than 13 billion videos during the month of April." So, exactly as predicted, the purchase hasn't paid off financially, but has made Google a content provider.

Update: Nearly five years later, YouTube still is only "close to being profitable", still not yet profitable. I doubt that is what Google had in mind when it acquired YouTube.

Wednesday, October 04, 2006

The advantages of big data and big clusters

Ionut Alex Chitu points to a UC Berkeley talk, "Theorizing from data: Avoiding the capital mistake" by Googler and AI guru Peter Norvig.

I particularly enjoyed Peter's thoughts on the advantages of big data and big clusters. Near the beginning of the talk, Peter said:

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm ... is performing better than the best algorithm on less training data.

Worry about the data first before you worry about the algorithm.

Later, near the end of the talk, Peter extended this point:

Is it just that Google has more data and more machines? But it couldn't be just the more data because [in some competitions] ... everyone got the same data.

So, I think that having more machines is a very important part because it allows us to turn around the experiments much faster than the other guys. So, it's not the online performance where you are actually doing the search that matters, but it's the -- gee, I have an idea, I think we should change this -- and we can get the answer in two hours which I think is a big advantage over someone else who takes two days. And, I think it also helps that we took an engineering approach of -- well, we'll try anything.

Amazon was similar in many ways. While Amazon did not have Google's mighty parallel processing tools and massive cluster, Amazon did have big data (transactional and log data) which it used extensively for website visible features like personalization and search query refinements and backend work like supply chain optimizations. In addition, Amazon was very early if not the first to do website A/B tests, a framework for rapidly testing new algorithms and designs live on Amazon.com, which encouraged behavior like Google's "try anything" engineering approach.

I find Peter's words particularly interesting when thinking about the Netflix contest. Netflix may be demonstrating how to do a Google-like experimental effort if you do not have Google-scale resources. The Netflix contest uses other people's machine resources and the power of many minds "trying anything" to attempt to find improvements to the Netflix recommender system.

See also notes by Brian Mingus on what appears to be the same talk a few weeks later at U of Colorado at Boulder.

[Ionut post found via Philipp Lenssen]

Monday, October 02, 2006

Google Reader redesigns

Google Reader, a feed reader similar to Ask.com's Bloglines, recently redesigned and is getting rave reviews ([1] [2] [3]).

If you try Google Reader and are in the mood to experiment, please also give Findory's feed reader a try. Unlike other feed readers, it constantly recommends other interesting articles and feeds. It uses what other Findory readers found to help you discover things you might otherwise miss.

See also my previous post, "RSS sucks and information overload", where I said, "The problem is that the current generation of feed readers merely reformat RSS for display."

See also my earlier post with details on Findory's feed reader.

A9 redesigns, simplifies

A9.com, Amazon's web search startup, has done a major design that completely changes the site.

It now appears to be a metasearch engine, like Dogpile or Metacrawler, that gives a lot of control over which seach engines are used.

I actually like the site better than before -- it seems cleaner and more usable to me -- but the functionality seems minimal, merely showing search results side-by-side from many search engines. Despite the spiffy AJAX UI on top, this is the kind of thing that has been around for a decade.

I would love to see A9 go a big step further and automatically decide which of thousands of search engines to query based on the information need of the searcher, then combining and reranking the results. That is a hard problem, but a very interesting one.

Danny Sullivan has some scathing comments about A9's redesign. Danny says, "Frankly, A9's always felt like some type of Amazon plaything, a way for Amazon to say they were in search but also pretend it was all just an experiment, if it failed to succeed. I think the failure is now apparent, and Amazon seems to be cutting its losses pretty dramatically."

Ouch, but there is a lot of truth in Danny's words. A9 spent millions going nowhere instead of attacking the interesting and hard problems in personalized search, federated search, and personalized advertising.

See also my previous post, "What will become of A9?"

Update: The punches keep coming. Paul Kedrosky says Amazon has been "wasting their time in nowhere search efforts." Joe at TechDirt says, "[Amazon] realizes it doesn't have much to bring to search." Ouch.

Update: Six months later, it appears this redesign did nothing to prevent a sharp drop in traffic to A9.com.

Netflix offers $1M prize for improved recs

This is interesting. Netflix is offering $1M in a contest to anyone who can improve the predictive accuracy of their recommendation engine by 10%.

From their Rules page:

We're quite curious, really. To the tune of one million dollars.

We've developed our world-class movie recommendation system: Cinematch. Its job is to predict whether someone will enjoy a movie based on how much they liked or disliked other movies. We use those predictions to make personal movie recommendations based on each customer's unique tastes. And while Cinematch is doing pretty well, it can always be made better.

Now there are a lot of interesting alternative approaches to how Cinematch works that we haven't tried. Some are described in the literature, some aren't. We're curious whether any of these can beat Cinematch by making better predictions. Because, frankly, if there is a much better approach it could make a big difference to our customers and our business.

So, we thought we'd make a contest out of finding the answer. It's "easy" really. We provide you with a lot of anonymous rating data, and a prediction accuracy bar that is 10% better than what Cinematch can do on the same training data set. (Accuracy is a measurement of how closely predicted ratings of movies match subsequent actual ratings.)

If you develop a system that we judge most beats that bar on the qualifying test set we provide, you get serious money and the bragging rights. But (and you knew there would be a catch, right?) only if you share your method with us and describe to the world how you did it and why it works.

Serious money demands a serious bar. We suspect the 10% improvement is pretty tough, but we also think there is a good chance it can be achieved. It may take months; it might take years.

There is no cost to enter, no purchase required, and you need not be a Netflix subscriber. So if you know (or want to learn) something about machine learning and recommendation systems, give it a shot. We could make it really worth your while.

Sounds like fun. I wonder how much time I'll end up wasting on this one.

If you are thinking of entering the contest, you might be interested to know that much of the Internet Movie Database (IMDb) database is available for download. Another good source for movie content is Amazon Web Services.

[Contest found via Pete Abilla and John Krystynak]

Update: I should explicitly point out that this Netflix data is by far the largest ratings data set available to the research community. Most work on recommender systems outside of companies like Amazon or Netflix has had to make do with the relatively small 1M rating MovieLens data or the 3M EachMovie data set. This Netflix data set is 100M ratings. It will be enormously useful for recommender system research.

Update: The comments on this post are starting to get pretty interesting.

Update: On the idea of using external movie data, Ilya Grigorik published data linking the Netflix movie ids to features extracted from IMDb data.

Geeking with Greg