Wednesday, July 30, 2008

Easy processing of massive data sets

Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou have an upcoming paper at VLDB 2008, "SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets" (PDF), that describes a parallel data processing tool "being used daily ... inside Microsoft" on "large clusters ... of thousands of commodity servers" over "petabytes of data".

Scope is similar to Yahoo's Pig, which is a higher level language on top of Hadoop, or Google's Sawzall, which is a higher level language on top of MapReduce. But, where Pig focuses on and advocates a more imperative programming style, Scope looks much more like SQL.

Some excerpts from the paper:
Internet companies store and analyze massive data sets, such as search logs, web content collected by crawlers, and click streams collected from a variety of web services .... Several companies have developed distributed data storage and processing systems on large clusters of shared nothing commodity services including Google's File System, Bigtable, MapReduce, Hadoop, Yahoo!'s Pig system,'s Neptune, and Microsoft's Dryad.

We present a new scripting language, SCOPE. Users familiar with SQL require little or no training to use SCOPE. Like SQL, data is modeled as a set of rows composed of typed columns. Every rowset has a well-defined schema ... It allows users to focus on the data transformations required to solve the problem at hand and hides the complexity of the underlying platform and implementation details.

SCOPE [also] is highly extensible ... users can easily define their own ... operators: extractors (parsing and constructing rows from a file), processors (row-wise processing), reducers (group-wise processing), and combiners (combining rows from two inputs) ... [which] allows users to solve problems that cannot easily be expressed in traditional SQL.
The paper goes on to provide many examples of code written in Scope, describe the underlying Cosmos append-only distributed file system, discuss query plans for executing Scope programs, and touch on some of Scope's compile and run-time optimizations.

Though I probably shouldn't, let me add that, in my experience, Scope is great fun. It's like all the data and computation I can eat. And I can eat a lot.

Please see also my past posts on related work, including "Yahoo, Hadoop, and Pig Latin", "Sample programs in DryadLINQ", "Automatic optimization on large Hadoop clusters", and "Yahoo Pig and Google Sawzall".

Please see also my Dec 2005 post, "Making the impossible possible", which quotes Nobel Prize winner Tjalling Koopmans as saying, "Sometimes the solution to important problems ... [is] just waiting for the tool. Once this tool comes, everyone just flips in their head."

Monday, July 21, 2008

Kai-Fu Lee keynote at SIGIR

Googler Kai-Fu Lee gave a keynote talk yesterday at SIGIR 2008 on "The Google China Experience".

The Google China experience has been fraught with difficulties. As Kai-Fu said, Google was expecting that their search engine could succeed in China with the same interface and approach that had worked so well elsewhere, but was "humbled" by the lack of uptake.

Why did a Google that had worked well elsewhere not succeed in China?

Kai-Fu argued that two unusual features of Chinese internet users, their youth and their language, appears to explain the difference.

While showing examples of sites that did succeed in China, Kai-Fu argued that the young, novice, hurried average Chinese user is looking less for navigation or one authoritative sources of information and more for entertainment and broad surveys of information.

Google China was optimized for finding the one site you need to go to, as it is elsewhere, but, Kai-Fu said, according to eyetracking studies and log data, Chinese users tend to be much less task-oriented, read much more of the page, and click many more links than US users.

So, Google started to offer up more opportunities for exploration and discovery -- by being much more aggressive with query suggestions and universal search, for example, and by adding browse links to the Google front page rather than just having a stark page with just a search box -- rather than just trying to give a quick answer and let people move on.

Kai-Fu suggested that the Chinese interest in browsing and exploration also may have roots in the Chinese language itself. Chinese is slower and more work to type than other languages, but more compact to read. This seems to cause a preference for clicking over typing and a preference for pages dense with information over the sparsity Google tended to favor in other countries.

One curious question that Kai-Fu raised was whether these preferences will remain true over time. Expert internet users tend to be more task-oriented than novice users. Google China has had much more success in gaining market share in China among expert users.

It may be the case that, as people gain more experience with the Web and what is available on the Web, their behavior shifts from browse to search, from exploration of what is out there to finding the information they know must exist.

Update: Paul Heymann at Stanford Infolab posts an excellent summary of Kai-Fu Lee's talk, as well as notes on other talks at SIGIR.

Tuesday, July 15, 2008

Learning diversity when learning to rank

Filip Radlinski, Robert Kleinberg, and Thorsten Joachims have a ICML 2008 paper, "Learning Diverse Rankings with Multi-Armed Bandits" (PDF), that attempts to "directly learn a diverse ranking of documents based on users' clicking behavior."

An excerpt:
[We] show how clickthrough data can be used to learn rankings maximizing the probability that any new user will find at least one relevant document high in the ranking .... [even though] web queries often have different meanings for different users.

We propose an online learning approach for learning from usage data. As training data is being collected, it immediately impacts the rankings shown ... The goal is to minimize the total number of poor rankings displayed over all time.
The work appears to be largely theoretical due to very long convergence times -- sadly, investigating "how prior knowledge can be incorporated ... to improve the speed of convergence" is left to future work -- but still is a worthwhile and enjoyable read.

Please see my previous post, "Actively learning to rank", that discusses a fun KDD 2007 paper also by Filip Radlinski and Thorsten Joachims on learning to rank from click data.

Update: Two years later, Filip publishes a paper that appears to have a more practical and scalable technique for learning diversity from click data, "Learning optimally diverse rankings over large document collections". More nice work there from Filip.

Automatic optimization on large Hadoop clusters

Chris Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava at Yahoo Research had a USENIX 2008 paper, "Automatic Optimization of Parallel Dataflow Programs" (PDF), that looks at optimizations for large-scale Hadoop clusters using Pig.

The paper says that it is only attempting "to suggest some jumping-off points for [optimization] work." With much of the paper spending a fair amount of time on "textbook ... optimizations" such as early projection and filtering, operator rewrites, and physical execution plans for joins, parts do read like a survey.

The part that looks at work sharing between jobs, however, goes beyond a survey and is quite interesting. I was particularly intrigued by the discussion of materialized views, which the authors suggest could be used to cache the results of common operations. One common early computation in these cluster is to extract a few fields from a file and then sort the extract based on one or more of the fields before doing additional processing. If that were cached, this could look a bit like lazily creating indexes over the data the first time they are needed. A fun idea, and one that could narrow the gap between Map-Reduce databases and more traditional databases.

One other interesting tidbit in the paper is that the authors argue that high level languages like Pig for programming large-scale parallel computations will become dominant over "direct Map-Reduce or Dryad programs" in a "mass migration" as "good optimization technology" is implemented, "analogous to the migration from assembly to C-style languages to Java-style languages."

Please see also a paper by some of the same authors, "Parallel Evaluation of Composite Aggregate Queries" (PDF). That paper looks at reducing expensive data combining operations in Hadoop clusters by implementing a "cross-node data redistribution strategy that takes into account the nested structure of a query ... such that aggregations can be generated locally" for a type of aggregation query.

Please see also my earlier post, "Hadoop and scheduling", that discusses another paper by some of the same authors and its look at a method of combining many jobs on a Hadoop cluster that all want to access the same raw data files.

Social peer-to-peer and Tribler

I finally got to a fun paper I have been meaning to read for some time, "Tribler: A social-based peer-to-peer system" (PDF).

What is interesting about the paper is that it proposes a combination of social networking and P2P for content sharing that not only is designed to produce recommendations from friends and users with similar tastes in the network, but also uses those friend relationships to speed downloads, reduce free-riding, and encourage good behavior.

An excerpt:
Most current P2P file-sharing systems treat their users as anonymous, unrelated entities ... [Tribler is] a novel social-based P2P file-sharing paradigm that exploits social phenomena by maintaining social networks and using these in content discovery, content recommendations, and downloading.

Tribler performs content discovery and recommendation based on the notion of taste buddies, that is, users with the same tastes or interests .... The ... interface facilitate[s] the formation of social groups ... [which] may help reduce anti-social behavior.

A user invokes the help of his friends to speed up downloads ... Peers contribute their bandwidth by joining a swarm even if they are not interested in the content being distributed in this swarm ... [Past work assumed] sufficient altruism ... [which] makes ... [it] impractical ... Our [approach] solves this problem by introducing social-group incentives.
Tribler is an open source project. More information at, especially on their research page.

Monday, July 14, 2008

Going to SIGIR

I will be at SIGIR 2008 next week in Singapore. If you make it to what is very nearly the antipode for those of us in the US, please say hello!

Monday, July 07, 2008

Google Toolbar data and the actual surfer model

There were a few interesting developments in the past couple weeks from Google that appear to relate to their ability to track much of the movements on the web.

As MG Siegler from VentureBeat noted, among many others, Google Trends now offers a feature that shows the traffic many websites get, much like Alexa does using data from the Alexa toolbar. The new functionality also shows similar pages and queries, people who visited site X also visited Y and people who visited site X also searched for Y.

Danny Sullivan, when talking about similar features Google launched in a new tool called Google Ad Planner, wrote:
I specifically asked ... [if] the [Google] toolbar is NOT [being used], and [a] "secret sauce" reply ... is all I got.

That makes me think that toolbar data IS being used. In particular, the focus on Google Analytics data feels like a sideshow. Google can't rely on Google Analytics as a core data source for this information, because of the simple reason that not every site runs it. In contrast, using Google Toolbar data would give them a nearly complete sample of all sites out there.
Erick Schonfeld at TechCrunch followed up with a post, "Is Google Ad Planner Getting Its Data From The Google Toolbar?

There isn't much new about using this data in the way Google Trends has revealed. Alexa and others have been doing it with their toolbars for many years. What is new is that Google Toolbar is installed much more widely, including on every Dell computer and in every installation of Adobe Flash.

I have no information about whether Google is using their toolbar data, but I have a hard time believing they could resist it. Not only can it be used for things like Google Trends, but it could have a dramatic impact in core web search.

PageRank, the original core of Google's search relevance, is an analysis of the links between websites that simulates someone randomly surfing across the links, the so-called random surfer model.

With toolbar data, you no longer needs this approximation of a random surfer model. You have the actual surfer model.

You know exactly how people move across the web. You know which sites are popular and which sites are never visited. You know which links are traversed and how often. You know everything about where people go and what people want.

Data like that should allow tremendous advances in relevance. It is hard for me to believe that Google would not be using their toolbar data, not just for Google Trends, but also in search and advertising.

Please see also my earlier post, "Ranking using Indiana University's user traffic", which discusses a paper from WSDM 2008 that attempts to supplement the PageRank random surfer model with an actual surfer model.

Black, white, and gray spam

Scott Yih, Robert McCann, and Alek Kolcz had a paper at CEAS 2007 on "Improving Spam Filtering by Detecting Gray Mail" (PDF).

The paper focuses on training a classifier to detect unwanted e-mail, which they called gray mail, but what excites me more is the broader idea of shades of gray in what we think of as spam. The ideas behind the paper seem to suggest that spam should be more of a continuum. Documents vary from annoying to interesting, with much falling in the middle.

Part of my motivation here comes from feeling bothered for some time at what seems to me to be a weak distinction between spammy pages and useless pages.

If a web page exists but no one ever reads it, does it matter?

If a blog exists but has no subscribers, does it matter that it is not actually spam?

In general, whether some document is uninteresting because it is manipulative or uninteresting for another reason, isn't it still uninteresting?

If we do treat spam and uninteresting as similar, it seems like it could make the spam problem somewhat easier. A false positive on spam is much less costly if the penalized document was uninteresting anyway.

Please see also my August 2006 post, "Web spam, AIRWeb, and SIGIR".

Please see also Googler Matt Cutt's post, "Using data to fight webspam".

Video recommendations on YouTube

A WWW 2008 paper out of Google, "Video Suggestion and Discovery for YouTube: Taking Random Walks Through the View Graph" (PDF) presents "a novel method based on the analysis of the entire user-video graph to provide personalized video suggestions for users."

Not sure if the same technique was used, but YouTube recently launched ([1] [2]) recommendations.

Update: Ten days later, Googler Sapna Mehta announces video recommendations on Google Video.

These recommendations appear to be quite different than the YouTube recommendations, based on your Google web and search history rather than the YouTube videos you watch and almost certainly using different techniques under the covers.

Interesting to see Google making a push on video recommendations both on YouTube and Google Video.

Please see also a review of the new Google Video recommendation feature on Search Engine Land.

Saturday, July 05, 2008

Google, the press, and tearing down your heroes

The press seems to have a pattern reporting on successful technology companies. First, these companies can do no wrong, the heroes of our time, bringing us clever new ways of doing things that promise to dramatically change our lives.

Then, reporters start nipping away at the edges. Maybe the hero has flaws? Is our hero really good? At the first sign of blood, the fangs sink in, and a frenzied attack tears down the hero the press itself had once created.

This happened dramatically when I was at Amazon. In 1999, Jeff Bezos was Time Magazine's Person of the Year. In 2000, we were "Amazon.bomb", no longer a revolution of retailing, but a fraud and a scam.

It probably is easier to draw blood in times of recession, when the business is already strained. The bloom came off Amazon's rose in 2000 during the stock market crash. In this recession of 2008, it looks like Google might have to endure similar treatment.

A few years ago, Google was built up into a hero. "The search engine that could" ([1]) had a "godlike view of what us mere mortals are thinking" ([2]). Google was "one of the most remarkable Internet successes of our time ... powered by the world’s most advanced technology ... [and] in a few short years has revolutionized access to information about everything for everybody everywhere." ([3]). The "all-knowing voice [will] always deliver the precise answer to any question in a fraction of a second." ([4]).

Now, the press is nipping away at our hero. Some question whether Google has become evil ([5] [6]). Some ask if our hero knows too much and question its motives ([7] [8]). There was even a rather goofy piece in the NYT today portraying Sergey Brin and Susan Wojcicki as insensitive tyrants who ignore the day care needs of Googler parents.

So far, the press has failed to draw much blood, but I doubt that will deter them. If the Amazon experience is any guide, this will not end until the beast has been fed. The press has built up their hero. They will now tear it back down. The cycle must complete.

For a more lighthearted view on all of this, don't miss The Onion's story, "Google Announces Plan To Destroy All Information It Can't Index".

Friday, July 04, 2008

Digg recommendation engine

Kevin Rose writes that Digg is launching a recommendation engine that "uses your past digging activity to identify what we call Diggers Like You .. [and] suggest stories you might like." Good to see it.

MG Siegler's posts at VentureBeat ([1] [2]) include some early reviews of the feature.

Please see also my previous posts on Digg, especially "Digg struggles with spam" and its discussion of how recommendations could reduce spam by mitigating the winner-takes-all effect.

Wednesday, July 02, 2008

Hadoop and scheduling

A VLDB 2008 paper out of Yahoo Research, "Scheduling Shared Scans of Large Data Files" (PDF) looks at how "to maximize the overall rate of processing ... by sharing scans of the same file ... [in] Map-Reduce systems [like Hadoop]".

I felt the model used for simulations in the paper was a bit questionable -- seems to me the emphasis should be on newer data being accessed by many jobs simultaneously, most often by smaller jobs -- but the specifics of the solution probably are of less interest than the paper's general discussion of scheduling issues that come up as a large Hadoop cluster is put under load from many users.

Please see also my earlier post, "Hadoop summit notes", especially the problems with scheduling using Hadoop on Demand (which makes a cluster look like multiple virtual clusters) and the idea of sharing intermediate results (such as sorted extracts) from previous jobs on the cluster.

Amazon page recommendations

Brady Forrest at O'Reilly Radar points out Amazon's new page recommendation widget in his post, "Amazon's Page Recommender: Foreshadowing A New Web Service?"

Put the widget on your website and, for any page it is on, Amazon can learn what people visit that page, where they go on your site, and then, from those behavior patterns, generate recommendations on where people might want to go from each page.

It could, for example, be used on a news website to generate news recommendations or, on a shopping site, to recommend products.

It is an interesting move by Amazon, a step toward Aggregate Knowledge and others that offer recommendations as a web service.

Microsoft, big computation, and big data

In his post, "Inside Microsoft's Internet Infrastructure & Its Plans for the Future", Om Malik highlights some interesting "facts about Microsoft-owned data centers":
[Microsoft is] adding 10,000 servers a month

Network backbone ... soon ... [will be] 500 Gigabits.

Data in the near future will soon approach 100s of petabytes.
All the computation you could want and all the data you can eat. Perhaps now it makes more sense why I am at Microsoft?

Please see also my April 2006 post, "Microsoft is building a Google cluster".

Website personas

Netflix has a feature called Profiles that allows multiple people to use the same Netflix account while keeping their queue and recommendations separate.

Recently, they attempted ([1] [2]) to remove the feature because "too many members found the feature difficult to understand and cumbersome, having to consistently log in and out of the website."

It's an interesting example how difficult it is to allow people to maintain multiple personas on a website. At Amazon, the second biggest complaint about the recommendations, next to recommending items already bought from a different store, was that sharing an account or buying a gift for someone that was not marked as a gift would end up mixing up recommendations for multiple people together.

But, the problem is that there is no easy and convenient way to provide multiple personas. Power users might be able to login and logout of different accounts, but most people don't understand or want to bother with that.

Hard problem. It's not clear to me what the solution might be.