Geeking with Greg: 04/01/2007

Thursday, April 26, 2007

Yahoo Pig and Google Sawzall

It is hard to tell from the limited documentation available, but the Pig project at Yahoo Research seems to have a lot in common with Sawzall at Google. Both are high level programming languages targeting massively parallel processing across huge clusters.

From the Pig project page:

We are creating infrastructure to support ad-hoc analysis of very large data sets. Parallel processing is the name of the game.

Our system runs on a cluster computing architecture, on top of which sit several layers of abstraction that ultimately bring the power of parallel computing into the hands of ordinary users.

The layers in between automatically translate user queries into efficient parallel evaluation plans, and orchestrate their execution on the raw cluster hardware.

This is similar to the motivation behind Sawzall. From the Sawzall paper:

To make effective use of large computing clusters in the analysis of large data sets, it is helpful to restrict the programming model to guarantee high parallelism ... Our approach includes a new programming language called Sawzall.

The language helps capture the programming model by forcing the programmer to think one record at a time, while providing an expressive interface to a novel set of aggregators that capture many common data processing and data reduction problems.

[Users can] write short, clear programs that are guaranteed to work well on thousands of machines in parallel ... The user needs to know nothing about parallel programming; the language and the underlying system take care of all the details.

Just as Google Sawzall is built on top of MapReduce, Yahoo Pig is built on top of Hadoop (an open source clone of MapReduce that is supported by Yahoo).

However, there do appear to be differences in the languages. Sawzall syntax appears heavily influenced by Java or Pascal, where Pig appears to be motivated by an attempt to extend SQL. For example, the Sawzall paper says:

The syntax of statements and expressions is borrowed largely from C; for loops, while loops, if statements and so on take their familiar form. Declarations borrow from the Pascal tradition.

The Pig documentation spends some time talking about how the language differs from SQL and why SQL is not sufficient:

The analogue to Pig Latin in the SQL world is "relational algebra." Pig Latin differs from the relational algebra in the following important ways:

1. The "join" operator is decomposed into two seperate operations: co−group and flatten.

2. Pig Latin has built−in support for nested data (it supports simple projection, filters, and sorting on nested constructs).

Why do we opt for Pig Latin over SQL? One reason is that many programmers don't "like" SQL, because it forces them to do acrobatics with their program logic, just to get it into a declarative "calculus" form. A final (but important) reason is that in general it can be difficult to convert complex SQL statements into efficient parallel programs.

Examples of some code in the two languages might be useful here. Here is an example of Sawzall code:

proto "querylog.proto" static RESOLUTION: int = 5; # minutes; must be divisor of 60 log_record: QueryLogProto = input; queries_per_degree: table sum[t: time][lat: int][lon: int] of int; loc: Location = locationinfo(log_record.ip); if (def(loc)) { t: time = log_record.time_usec; m: int = minuteof(t); # within the hour m = m - m % RESOLUTION; t = trunctohour(t) + time(m * int(MINUTE)); emit queries_per_degree[t][int(loc.lat)][int(loc.lon)] <- 1; }

And, here is an example of Pig code:

a = COGROUP QueryResults BY url, Pages BY url; b = FOREACH a GENERATE FLATTEN(QueryResults.(query, position)), FLATTEN(Pages.pagerank); c = GROUP b BY query; d = FILTER c BY checkTop5(*);

I have to say, it is good to see Yahoo building these kinds of tools for large scale data manipulation.

Google's massive cluster and the tools built on top of it have been called a "competitive advantage", the "secret source of Google's power", and a "major force multiplier".

As Peter Norvig said, "It allows us to turn around the experiments much faster than the other guys ... We can get the answer in two hours which I think is a big advantage over someone else who takes two days."

Update: There is some discussion of a similar effort at Microsoft, the Dryad project, in the comments for this post.

Wednesday, April 25, 2007

Talk on challenges in Web ad auctions

I enjoyed this Google Tech Talk by MSR post-doc Nicole Immorlica, "Challenges in the Design of Sponsored Search Auctions".

The talk starts with a short history of Web advertising and a description of the second price auction mechanism commonly used for pay-per-click advertising. I suspect many readers of this weblog would be interested in the discussion of click fraud starting at 12:34 that includes a discussion of common methods for detecting click fraud.

My favorite part started at 28:09 when Nicole brought up how advertisers with budget constraints wreak havoc on the efficiency of the generalized second price auction. She gave several examples where advertisers would not be truthful in their bids when they had limited budgets for their ads, yielding suboptimal outcomes for the auction.

Nicole ended with some thoughts on situations that can cause regular fluctuations in advertising auction bidding and how that kind of instability can be avoided.

Much of this talk appears to be covered in more detail in some of the papers Nicole co-authored, including an upcoming WWW 2007 paper, "Dynamics of bid optimization in online advertisement auctions" (PDF), and an older EC 2005 paper, "Multi-Unit Auctions with Budget-Constrained Bidders" (PDF).

Tuesday, April 24, 2007

Social search and the same old problems

Matthew Hurst pointed to a talk by Yahoo Researcher Prabhakar Raghavan on "The changing face of web search" (PDF). The talk covers good ground, including an overview of some issues in web advertising and of a few types of social applications.

What particularly caught my attention were slides 11-14 on social search. On slide 13, Prabhakar lists challenges in social search:

How do we use these tags for better search?

How do you cope with spam?

What's the ratings and reputation system?

Rephrasing slightly, the challenges are how to we know what user contributed data matters and how do we determine how reliable that information is?

But, wait a second. Those are the same problems we face with traditional web search. Web pages and links between web pages are created by people. We need to determine how reliable those pages and links are and what information in those pages is useful.

Prabhakar says as much two slides earlier, noting that "the wisdom of the crowd can be used to search" but also saying "the principle is not new -- anchor text [is] used in 'standard' search." That is, the wisdom of the crowd, in traditional web search, is in the web pages and web links. In social search, users can create additional snippets of content than web pages. That additional content can also be used in search.

But the problem remains the same. What information is useful? What is the reliability of that information? And, as slide 14 suggests, trying to solve these problems in social search probably looks similar to solving them in traditional search, mostly involving propagating usefulness and trust along the associations and links between the pieces of data.

It makes me wonder how much leverage there is from the idea of social search. Does social search change the nature of the problem? Does the new data somehow help solve what otherwise would be hard problems for traditional search? Or do the same old problems follow those who try to move to social search?

See also my previous post, "Chris Sherman on social search".

A virtual chat room for work

Paul Lamere recently mentioned a project at Sun Labs called "MPK20: Sun's Virtual Workplace".

Paul describes it as "Second Life for work" and that is what it looks like, complete with avatars and a third person view, but adding handy virtual conferencing features like the ability to share documents and use whiteboards.

This reminds me of a crazy idea I was bouncing around a while back with some colleagues who were kind enough to listen to my ramblings about what the next generation of virtual conferencing and chat rooms might look like.

The thought is to try to simulate the view you would have from physically sitting around a table and then enhancing it with a level of document sharing and focused attention that would not be possible in the physical world. The participants would be engaged in conversation, smoothly be able to look at documents, zoom on parts of a presentation, and investigate relevant threads, all while still hearing and participating in the conversation.

I apologize for such an absurdly geeky reference, but I have to admit much of my motivation here comes from a virtual chat room scene from the anime series Ghost in the Shell. The example in this clip shows a virtual chat room where multiple participants are able to see each other, data and documents can be shared, additional information can be accessed and brought into view, and a transcript is maintained within the field of view.

Clearly, a full virtual reality like this is unachievable at the moment, but I wonder how far we might be able to get toward that goal. I think it is an attractive vision, a large step beyond the collaboration possible currently in IM or virtual conferencing.

Update: John Battelle points to HP's HALO. Cool, but telepresence with high resolution video feeds is very expensive and likely to remain so. I wonder how far someone could get with avatars in a virtual conference/chat environment, something that is usable with a minimum of special equipment.

Update: A few months later, Bob Cringely talks about HP's HALO and calls telepresence "The Next Killer App".

Personalization, Google, and discovery

Gord Hotchkiss, a columnist at the popular Search Engine Land, posts some thoughts on Google Web History and Google's aggressive moves in personalized search.

This allows Google to ... [tap] into your current browsing behavior to try to determine what's on your mind right now ... It helps Google interpret just the kind of site you want to see, given your behavior at the present time.

[Google can] find similar sites you may have never considered, based on the characteristics of the sites you have been visiting ... It's not just providing you a shortcut to sites you are already aware of, it's in making you aware of new sites you never knew existed, ranked and prioritized according to the PageRank algorithm.

The promise of personalization is moving Google to be a true recommendation engine.

Search is about finding. Recommendations are about discovering.

It is hard to find something if you do not know it exists. It is hard to find something if you cannot easily say exactly what you want.

Personalization can help you discover information you would not have found on your own, sites you never knew existed, based on your past behavior and interests. Personalization and recommendations aid discovery.

Update: A few days later, Gord Hotchkiss publishes an interview on personalization with Googlers Marissa Mayer and Sep Kamvar.

Update: In yet another article, Gord Hotchkiss says:

As Sep and his team begin to refine personalization, expect it to be aggressively rolled into multiple aspects of your Google experience.

[Personalization is] the engine that will power the future of Google for the foreseeable future. It will eventually surpass the PageRank algorithm in importance, giving Google the ability to match content to very specific and unique user intent on the fly.

Not remaking the internet

Security guru Ed Felten writes about the efforts to redesign and rebuild the internet:

It's folly to think we can or should actually scrap the Net and build a new one.

The Net is working very nicely already. Sure, there are problems, but they mostly stem from the fact that the Net is full of human beings -- which is exactly what makes the Net so great. The Net has succeeded brilliantly at lowering the cost of communication and opening the tools of mass communication to many more people.

Let's stop to think about what would happen if we really were going to redesign the Net. Law enforcement would show up with their requests. Copyright owners would want consideration. ISPs would want some concessions, and broadcasters. The FCC would show up with an anti-indecency strategy. We'd see an endless parade of lawyers and lobbyists. Would the engineers even be allowed in the room?

The original design of the Internet escaped this fate because nobody thought it mattered. The engineers were left alone while everyone else argued about things that seemed more important. That's a lucky break that won't be repeated.

The good news is that despite the rhetoric, hardly anybody believes the Internet will be rebuilt ... For better or worse, we're stuck with the Internet we have.

The temptation to attempt to rearchitect the internet is high. Spam dominates e-mail and clogs servers. Downloads of copyright materials are rampant. The internet is poorly suited to traffic that requires real-time guarantees, like VoIP or video, or to broadcast traffic like TV.

But the cure likely would be worse than the disease. When the internet was built, no one had a financial interest in meddling with the details. Now, many would see big profits if they can push for certain outcomes. A new internet would more likely be a sickly pile of legal goo than a product of good engineering.

Dasher: Navigation for text entry

A fun Google Tech Talk, "Dasher: information-efficient text entry", presents an interesting alternative method of text entry to the norm of keyboards, alphanum pads, or graffiti-style handwriting recognition.

David MacKay motivated the system with one of the best one-line descriptions I have ever heard:

Writing is navigating the library of all possible books.

To see what he means, take a look at the demo of Dasher starting at 6:20 in the video. The system feels similar to driving through the text, pointing at the letters you want as you move to and through them.

The demos of navigating using eye tracking (24:52) and the much cheaper head tracking (27:42) are also worthwhile and clearly demonstrate the value of the approach for disabled users.

Despite the suggestion and demo (at 31:20) of Dasher on mobile devices, I suspect the tiny screens on cell phones would remain a problem. However, this kind of data entry interface might be usable and further motivate the trend to larger, all-device touchscreens on mobile devices instead of dedicated buttons and controls (e.g. the Apple iPhone).

The entire talk, not just the demos I linked to, is well worth watching, especially the motivation in the first 5-10 minutes. It really is quite clever.

The Dasher project page has more information, including a downloadable version of the application.

Wednesday, April 18, 2007

StumbleUpon and Google's StumbleUpon clone

Just as StumbleUpon (a popular toolbar that recommends web pages) is rumored to have been acquired by eBay (congrats, Garrett!), Google personalization guru Sep Kamvar announces a similar feature integrated into the Google Toolbar.

StumbleUpon recommends web pages based on pages you explicitly rate while the Google Toolbar recommendation feature learns from the implicit information in your Google search history to determine what kinds of web pages you seem to like. In my usage, both return a lot of misses and some hits, enough hits to be fun and occasionally useful exploration tools.

See also coverage by Inside Google ([1] [2]), Search Engine Land ([1] [2]), GigaOM ([1] [2]), SearchViews, and Google Blogoscoped.

Update: Mike Arrington notes that the Google feature is "strikingly similar" to StumbleUpon and John Battelle says:

The strategy ... [is to] have something in the wings, in case [they] don't win the acquisition game. This case is small - StumbleUpon. But from sources I've talked to, Google had built an entire Doubleclick killer, in case it did not win there.

Update: Five months later, a BusinessWeek article claims eBay may use StumbleUpon "to recommend items to users based on their shopping history and the shopping histories of other people like them." I wonder if that feature would be in addition to or a replacement of the web page recommender toolbar.

Tuesday, April 17, 2007

What is Microsoft's CloudDB?

It appears that Microsoft is working on a clone of Google's BigTable codenamed Blue/CloudDB.

Mary Jo Foley reports that "CloudDB is a file-system-based storage system ... 'Blue' also seems to refer to the query processor that builds on top of the cheap, file-system-based storage enabled by CloudDB."

That sounds quite similar to BigTable, a distributed database built on top of GFS.

It seems hard to track down any other information about this effort. I was able to find a Microsoft job posting for a senior SDE that says:

Our team is building a geo-distributed, reliable and high performance internet facing service. You will help internal and external partners to realize Microsoft's vision of seamless access to your data - anywhere, anytime, any device.

You will learn about the Blue/CloudDB storage system, you will work on an API definition, solve problems around security and Quality of Service guarantees.

Do you find it fascinating to build a service that is scalable, geo-distributed, has efficient queuing, load balancing? We will solve together problems about performance, reliability and intelligent optimization for different traffic patterns. The product will be used also by internal Windows Live Services and you will have the opportunity to interact with multiple groups in the company that need easy, efficient and simple storage in the cloud.

Join us and help Microsoft go head to head with Amazon, Google, and Yahoo in cloud storage.

From that description, it sounds like Microsoft's CloudDB may be intended to compete with several efforts, including Amazon S3, AOL's XDrive, the Yahoo-supported Hadoop, and Google's BigTable, Google File System, and GDrive.

I am not even sure if it is from Microsoft, but I did also find something that looks like a leak of some pieces of an internal document called "CloudDB Living Spec v0.51.doc". The web page mentions the needs of Microsoft services (including MSN Shopping and Fremont/Expo) and echos terminology used in Google's BigTable paper. Some selected excerpts:

SSTable's efficient data is less than 40%, column name occupy much room in each item. Maybe we should use the column id to reduce this part of overload.

Sparse columns baked into the app. This scenario arises when an MSN service defines a large set of potential attributes for a type of an object, but only very few of these attributes are expected to be set on any given object.

For example ... [in] MSN Shopping ... the total set of attributes that products can have (e.g. "Pixel Resolution") is very large, but any given product only has a few (a vacuum cleaner doesn't have 'Pixel Resolution').

[A] service wants to be able to add 'attributes' to people, without having to waste space in every person for every attribute ... The most prominent example of this is Freemont --- this service wants to allow users to add their own columns to their listings. The total number of columns used across all objects can be huge (millions).

Note: Google Base advertises support (and is optimized) for cases where even a single object can have millions of columns.

Monday, April 16, 2007

DoubleClick on Google

It is being widely reported that Google is acquiring DoubleClick for $3.1B.

Reaction to this announcement has been interesting. Microsoft's Don Dodge says, "DoubleClick ... two years ago ... [was] valued at less than $1 billion ... They later sold off two divisions for $525 million. Yesterday Google paid $3.1 billion for what remained of DoubleClick."

Henry Blodget writes that this is "a big management challenge, a significant price tag, [and] an admission on the largest scale to date that it sometimes makes sense to buy instead of build."

Google VP Susan Wojcicki claims this is "the next step in Google advertising." Susan's words sounded unintentionally ominous, however, after I read Philipp Lenssen highlights of DoubleClick's history which includes a number of issues with privacy violations.

Paul Kedrosky sees this as "a brazen attempt to cut off Microsoft's future air supply" and "a strategic and offensive buy." I disagree with Paul here. I have a hard time seeing this as anything but a defensive move. Microsoft is drawing little air from advertising revenue and will not be for some time. I think it is clear that Google is trying to ward off Microsoft's attempts to pinch off its advertising revenue air supply, not visa-versa.

Nathan Weinberg, normally a huge fan of all things Google, slams the acquisition in two posts ([1] [2]), calling it "crazy" and "arrogant". Ionut Alex Chitu notes a passage from John Battelle's book "The Search" on Google's decision to build its own advertising engine back in 1999: "DoubleClick's ads were often gaudy and irrelevant. They represented everything Page and Brin felt was wrong with the Internet."

As for me, I am not sure what to think. On the one hand, this acqusition does bring in some additional revenue, acquires some important relationships with existing DoubleClick advertisers, and helps defend Google's advertising market share, its air supply.

On the other hand, this is a very expensive acquisition of a large company with a history of privacy issues. DoubleClick appears to be culturally different than Google, posing an integration challenge. Put simply, like YouTube, DoubleClick is not Googly. I foresee problems down the road as these two beasts mash together to try to bear useful offspring.

Google 411 from Google Labs

Right after Microsoft acquired TellMe, Google Labs has launched a local search using voice recognition at 1-800-GOOG-411.

In my tests, Google Voice Local Search worked fairly well, good enough that I intend to use it for most local searches from my cell phone. Others report more mixed results.

Some of the reaction to to this has been interesting. Paul Kedrosky says, "Google just killed the directory assistance business," a statement that probably gives too much credit to Google, but may accurately describe the long-term trend.

Tim O'Reilly mentions Googler Peter Norvig's love for big data and wonders if this is "designed to harvest voice data to build Google's own speech database" to create a "competitive advantage in speech recognition."

Probably true, but I doubt that is their primary motivation. Rather, I suspect they are trying to work around the UI problems of search on mobile devices.

I am convinced that the long-term success of mobile search will require overcoming the issue of the tiny screens on cell phones and other mobile devices.

While there are a few promising paths to manage with the small screens -- including Patrick Baudisch's clever UI work at Microsoft Research, search personalization to display only the most relevant data in the limited space available, or the iPhone's use of the entire surface of the device as a screen -- I am concerned that these approaches may rapidly hit their limits.

Rather, I suspect we will have to replace the tiny screen with something else. One good option is a voice recognition interface like Google Voice Search or the one TellMe has developed. These make the tiny displays unnecessary.

Another approach may be to make the tiny screen a huge screen. The most promising work I have seen here is the Virtual Retinal Display being developed at the HIT Lab at University of Washington. By drawing video images directly on your retina, it replaces tiny displays with a massive one that covers your full field of view.

The constrained input and display on mobile devices cripple the potential of mobile search. Resolving these issues within the form factor of these devices is not trivial. A voice interface, like Google Voice Local Search, may be one of the better solutions.

Ask's Edison and clickstream analysis

Barry Schwartz at Search Engine Land reports that Ask.com is attempting to integrate an expanded version of DirectHit's clickstream analysis technology with its Teoma relevance rank algorithm.

Barry quotes Ask VP Rahul Lahiri as saying:

It's a next generation algorithm that ... synthesizes modernized versions of Teoma and DirectHit technologies ... It's much more complicated than ... just counting clicks ... The technologies we have ... go way beyond that ... taking a deeper look at communities and calculating the authorities in those communities.

We ... [are] looking into the universe of user behavior, and what that could tell us, and the social fabric of the Web itself, and what that tells us.

It is hard to tell from this description, but it sounds like they are using clickstream data to attempt to identify good or trustworthy sites, then propagating that information through the link graph. If so, it might be similar to the ideas described in TrustRank.

It does not sound like Ask intends to use clickstream data to show different results to different people, to do personalized search, at least not in this particular project.

Tuesday, April 03, 2007

Creative destruction on a small scale

Findory has a few paragraphs in an article in the Wall Street Journal today, "When a Tech Start-Up's Dreams Turn Prosaic" (subscription required).

Excerpts on Findory:

The absence of many tech IPOs in recent years also means the financial fallout and public awareness of start-up disappointments will be more limited. Findory.com Inc. shows just how quiet and contained the disappointments can be.

Former Amazon.com [employee] Greg Linden launched the Web personalization start-up in January 2004. User traffic to the site, which recommended content to users based on the Web sites they visited, doubled each quarter for the first two years until it began to plateau in early 2006, increasing only about 5% quarterly. Mr. Linden couldn't persuade venture capitalists to fund the company so it could make a major marketing push to reach more consumers.

While he could have continued to finance the site himself with the help of ad revenue, he wasn't excited about the business opportunities such as licensing the Findory technology to other firms. "When you start a company around this Google-size vision, it's hard to be really passionate about switching to that," says Mr. Linden.

In January, he announced he would stop working on Findory and let the site run on autopilot for awhile before shutting it down. But repercussions are limited: Mr. Linden was the only Findory employee [by that point] ... "This is kind of creative destruction on a small scale," he says.

See also my Jan 2007 post, "Findory rides into the sunset".

See also my post, "Starting Findory: In the beginning", that discusses the motivation behind creating Findory.

Recap of recent posts

There has been a flurry of long posts on papers and lectures here in the last week. It might have been a bit overwhelming. I would not be surprised if your eyes glazed over on a Monday morning -- ugh, too much to read -- and the posts passed you by.

But, there is some really good stuff in there. In case you missed them, I wanted to highlight a couple key posts on a couple particularly interesting topics:

"Knowledge extraction from search queries" talks about a couple papers out of Google on extracting facts from the Web. Question answering -- correctly answering questions such as "How old is Larry Page?" -- is an important and promising path to improving web search. This Google work is particularly unusual in that they propose using query logs and the information in them to help with knowledge extraction and question answering.

"The end of federated search?" and "Google and the deep web" discuss Google's efforts to crawl the deep web, data normally hidden in private databases behind html forms. Deep web data would make web search more comprehensive and, because the data often is well structured, could be particularly useful for improving question answering. The key part of the Google work is that it rejects a common technique of accessing deep web data in real-time, instead proposing copying everyone else's data to Google's servers.

"More on data center in a trailer" talks about Microsoft's and others' efforts to factory-install thousands of computers in a shipping container and the efficiencies gained from that approach.

Sunday, April 01, 2007

Knowledge extraction from search queries

Googler Marius Pasca is first author on a couple recent papers on knowledge extraction from the Web.

"Organizing and Searching the World Wide Web of Facts - Step One: the One-Million Fact Extraction Challenge" (PDF) is "a first concrete step towards building large searchable repositories of factual knowledge" from the information scattered across the World Wide Web:

A particularly useful type of knowledge for Web search consists in binary relations associated to named entities ... e.g. the facts that "the capital of Australia is Canberra", or "Mozart was born in 1756", or "Apple Computer is headquartered in Cupertino".

A search engine with access to hundreds of millions of such Web-derived facts can answer directly fact-seeking queries, including fully-fledged questions and database-like queries (e.g., "companies headquartered in Mountain View"), rather than providing pointers to the most relevant documents that may contain the answers.

Moreover, for queries referring to named entities, which constitute a large portion of the most popular Web queries, the facts provide alternative views of the search results, e.g., by presenting the birth year and bestselling album for singers, or headquarters, name of CEO and stock symbol for companies, etc.

As described, this Google paper seems to have a lot in common with the KnowItAll project (which is cited) and other past attempts to do knowledge extraction from the Web.

Things get more interesting once we add in the second paper, "What You Seek is What You Get: Extraction of Class Attributes from Query Logs" (PDF). This paper focuses on using user behavior expressed in query logs for knowledge extraction. From the paper:

In a significant departure from previous approaches to large-scale information extraction, the target information (in this case, class attributes) is not mined from document collections. Instead, we explore the role of Web query logs, rather than documents, as an alternative source of class attributes.

To our knowledge this corresponds to the first endeavor in large-scale knowledge acquisition from query logs.

At first sight, choosing queries over documents as the source data may seem counterintuitive ... Indeed, common wisdom suggests that textual documents tend to assert information ... Comparatively, search queries can be thought of as ... approximations of often underspecified user information needs (interrogations).

However, users formulate their queries based on the common-sense knowledge that they already possess at the time of the search. Therefore, search queries play two roles simultaneously: in addition to requesting new information, they also indirectly convey knowledge in the process ... If knowledge is generally prominent or relevant, people will (eventually) ask about it.

The Web as a whole represents a huge repository of human knowledge ... Web search queries as a whole also mirror a significant amount of knowledge.

I love this idea of using the implicit information in people's behavior to assist with knowledge extraction. Searchers indirectly convey knowledge when they search. That knowledge is concealed within search histories and clickstreams.

Every day, millions seek knowledge on the Web. These actions, if we can understand them, can act as a guide to others, people helping people find the information they seek.

It appears Google will be expanding on both of these papers in an upcoming WWW 2007 paper, "Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds" (abstract). That paper will present a method for "weakly supervised extraction of class attributes (e.g., 'side effects' and 'generic equivalent' for drugs) from anonymized query logs" that has "accuracy levels significantly exceeding current state of the art."

Update: "Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds" (PDF) is now available and well worth reading. As the author says, "The most intriguing aspect of [search log] queries is ... their ability to indirectly capture human knowledge."

The costs of annoying advertising

A study in CACM, "The effects of online advertising", has data on what many have argued for some time, that pop-up and pop-under ads annoy and repel users:

Pop-ups were [rated] 24% more intrusive than in-line ads and pop-unders were 33.1% more intrusive than in-line ads.

Subjects who were not exposed to ads reported they were 11% more likely to return or recommend the site to others than those who were exposed to ads.

Subjects exposed to in-line ads remembered 3.4% more of the material in the site than those exposed to pop-ups.

Designers should realize the magnitude of ill effects caused by advertising ... Reducing the likelihood of a person's return by 11% might be a cost that is too great for a site host to bear.

I wonder if websites that continue to annoy with pop-ups and pop-unders are only measuring short-term gains (e.g. clicks on the pop-up ad) and not long-term costs (e.g. loss in retention, higher abandon rate, lower return rate).

Paper on large scale IR at IBM

It may have been the cute title that attracted me to this paper, "A funny thing happened on the way to a billion..." (PS), but it turns out to be an good read on "hard-won lessons" on supporting "queries on billions of documents" on the IBM Semantic Super Computing platform.

Some of the lessons are familiar to those working on a large scale cluster. For example, in Section 5.1, they talk about "what can go wrong when a query is sent out to 256 nodes", including that "some of the nodes will almost certainly not be online", "a few might be either hung or responding very slowly", "the network ... may be come overwhelmed", and errors or performance issues related to configuration problems on the boxes.

Section 2.3 on sampling the data offers a useful suggestion. The IBM system indexed on a 128 bit random unique identifier for the documents, allowing them to quickly answer some types of aggregate queries after only examining a small sample of the documents. The authors write that "using an index that already provides a uniform and random distribution" allows "orders of magnitude" fewer disk seeks than alternative approaches that would sample from a sorted index. It is a good point that grouping data by a random UID before sampling turns what otherwise would be random disk accesses into sequential reads, substantially reducing disk seeks.

Other lessons focus more on the user experience of search. The authors write that "discovery is more than search" and argue for tools to help explore data and documents. They also write that "people are not going to learn something new unless they have to" and suggest that the tools should be easy to use at first, but provide power underneath that makes "complex things ... possible."

Several other lessons are in there as well, from issues with disk I/O to testing to cluster management.

Geeking with Greg