Wednesday, March 24, 2010

Asking questions of your social network

Google recently acquired the startup Aardvark for $50 million. The idea behind Aardvark is to provide a way to ask complicated or subjective questions of your friends and colleagues.

There are two papers that will be published at upcoming conferences that provide useful details on this idea. The first is actually by two members of the Aardvark team -- co-founder and Aardvark CTO Damon Horowitz and ex-Googler and Aardvark advisor Sep Kamvar -- and will be published at the upcoming WWW 2010 conference. The paper is called, "The Anatomy of a Large-Scale Social Search Engine" (PDF). An excerpt:
With Aardvark, users ask a question, either by instant message, email, web input, text message, or voice. Aardvark then routes the question to the person in the user's extended social network most likely to be able to answer that question.

Aardvark queries tend to be long, highly contextualized, and subjective -- in short, they tend to be the types of queries that are not well-serviced by traditional search engines. We also find that the vast majority of questions get answered promptly and satisfactorily, and that users are surprisingly active, both in asking and answering.
The paper is well written and convincing, establishing that this idea works reasonably well for a small (50k) group of enthusiastic early adopters. The paper does not answer whether this will work at large scale on a less motivated, lower quality mainstream audience. It also does not provide data to be able to evaluate the common criticism of asking questions of social networks, which is that, at large scale, the burden from a flood of often irrelevant incoming questions creates too much pain for too little benefit.

To get a bit more illumination on that question, turn to another upcoming paper, this one out of Microsoft Research and to be published at CHI 2010. The paper is "What Do People Ask Their Social Networks, and Why? A Survey Study of Status Message Q&A Behavior" (PDF). Some excerpts:
50.6% ... used their status messages to ask a question .... [on] sites like Facebook and Twitter .... 24.3% received a response in 30 minutes or less, 42.8% in one hour or less, and 90.1% within one day .... 69.3% ... who received responses reported they found the responses helpful.

The most common reason to search socially, rather than through a search engine, was that participants had more trust in the responses provided by their friends [24.8%]. A belief that social networks were better than search engines for subjective questions, such as seeking opinions or recommendations, was also a common explanation [21.5%].

The most common motivation given for responding to a question was altruism [37.0%]. Expertise was the next biggest factor [31.9%], with respondents being motivated because they felt they had special knowledge of the topic ... Nature of the relationship with the asker was an important motivator [13.7%], with closer friends more likely to get answers ... [as well as] the desire to connect socially [13.5%] ... free time [12.3%] ... [and] earning social capital [10.5%].

Many indicated they would prefer a face-to-face or personal request, and ignored questions directed broadly to the network-at-large .... [But] participants enjoyed the fun and social aspects of posing questions to their networks.
The key insight in the CHI paper is that people view asking questions of their social network as fun, entertaining, part of building relationships, and as a form of a gift exchange. The Aardvark paper focuses on a topical relevance rank of your social network, but maintaining relevance is going to be difficult at large scale when you have an unmotivated, lower quality, mainstream audience. The CHI paper might offer a path forward, suggesting we instead focus on game playing, entertainment, and the social rewards people enjoy when answering questions from their network.

Tuesday, March 23, 2010

Security advice is wrong

An insightful, funny, and thought-provoking paper by Cormac Herley at Microsoft Research, "So Long, And No Thanks for the Externalities: The Rational Rejection of Security Advice by Users" (PDF), looks at why people ignore security advice.

The surprising conclusion is that some security advice we give to people -- such as inspect URLs carefully, pay attention to https certificate warnings, and use complicated passwords that change frequently -- does more harm than good. It actually costs someone far more to follow the advice than the benefit that person should expect to get.

Extended excerpts from the paper:
It is often suggested that users are hopelessly lazy and unmotivated on security ... [This] is entirely rational from an economic perspective ... Most security advice simply offers a poor cost-benefit tradeoff to users and is rejected.

[Security] advice offers to shield [people] from the direct costs of attacks, but burdens them with increased indirect costs ... Since victimization is rare, and imposes a one-time cost, while security advice applies to everyone and is an ongoing cost, the burden ends up being larger than that caused by the ill it addresses.

To make this concrete, consider an exploit that affects 1% of users annually, and they waste 10 hours clearing up when they become victims. Any security advice should place a daily burden of no more than 10/(365 * 100) hours or 0.98 seconds per user in order to reduce rather than increase the amount of user time consumed. This generated the profound irony that much security advice ... does more harm than good.

We estimate US annual phishing losses at $60 million ... Even for minimum wage any advice that consumes more than ... 2.6 minutes per year to follow is unprofitable [for users] from a cost benefit point of view ... Banks [also] have more to fear from ... indirect losses such as support costs ... than direct losses. For example ... an agent-assisted password reset by 10% of their users would cost $48 million, easily dwarfing Wells Fargo's share of the overall $60 million in phishing losses.

Users are effectively trained to ignore certificate warnings by seeing them repeatedly when there is no real security threat .... As far as we can determine, there is no evidence of a single user being saved from harm by a certificate error, anywhere, ever ... The idea that certificate errors are a useful tool in protecting [users] from harm ... is entirely abstract and not evidence-based. The effort we ask of [users] is real, while the harm we warn them of is theoretical.

Advice almost always ignores the cost of user effort ... The main reason security advice is ignored is that it makes an enormous miscalculation: it treats as free a resource that is actually worth $2.6 billion an hour ... Advice-givers and policy-mandaters demand far more effort than any user can give .... User education is a cost borne by the whole population, while offering benefit only to the fraction that fall victim ... The cost of any security advice should be in proportion to the victimization rate .... [lest] in trying to defend against everything we end up defending nothing.

It is not users who need to be better educated on the risks of various attacks, but the security community. Security advice simply offers a bad cost-benefit tradeoff to users .... We must respect users' time and effort. Viewing the user's time as worth $2.6 billion an hour is a better starting point than valuing it at zero ... When we exaggerate all dangers we simply train users to ignore us.
The paper also has a great discussion of password policies and how they appear to be counter-productive. When system administrators require passwords with weird special characters than need to be changed regularly, they make passwords difficult to remember and impose a substantial burden on users, but the benefit from this policy appears to be minimal.

[Paper found via Bruce Schneier]

Sunday, March 21, 2010

World of Goo on Google

This is too funny not to post. The clever and tremendously fun puzzle game World of Goo, in one of its last levels, has a hilarious jab at the search engines, personalized search, and personalized advertising.

An excerpt from the dialogue with "MOM ... [the] Automated Search Engine Companion":
What would you like to ask MOM today?

Do I smell cookies? ... Yes! I am baking your personal information into each one.

What personal information is in your cookies? ... Cookies may contain:
your location
complete search history
whipped soy product
online purchases
medical records
telephone logs
web logs
email logs
streaming video from your current location
and more!
That's MOM's special recipe for offering free, convenient, relevant information to valued users like you.

Is my personal information safe with you? ... Your personal information is stored in a secure database and will never be shared with anyone*
* Unless they ask.
** Or if someone says they are you and takes your cookies
*** Or if the venture firm finds out it's profitable.
**** Or if unhappy employees release copies of your cookies to other online databases.
***** Or if outsourced data centers sell illegal copies of your cookies to other sites.
****** Or if my parent corporation is acquired and my data including your cookies becomes part of a larger aggregate system without your knowledge or consent.

Delete my cookies ... Are you sure? ... Yes ... Your cookies have been deleted*.
* Cookies may not actually be deleted.
** Cookies may be stored indefinitely for evaluation and training purposes to better serve you.
*** MOM knows best.

Everyone loves receiving special offers from MOM and the MOM's affiliate network of adver-bots.
Video is available as well as a full transcript (search for [Conversation with MOM transcript]).

Tuesday, March 16, 2010

Designing search for re-finding

A WSDM 2010 paper out of Microsoft Research, "Large Scale Query Log Analysis of Re-finding" (PDF), is notable not so much for the statistics on how much people search again for what they have searched for before, but for its fascinating list of suggestions (in Section 6, "Design Implications") of what search engines should do to support re-finding.

An extended excerpt from that section:
The most obvious way that a search tool can improve the user experience given the prevalence of re-finding is for the tool to explicitly remember and expose that user's search history.

Certain aspects of a person's history may be more useful to expose ... For example, results that are re-found often may be worth highlighting ... The result that is clicked first ... is more likely to be useful later, and thus should be emphasized, while results that are clicked in the middle may be worth forgetting ... to reduce clutter. Results found at the end of a query session are more likely to be re-found.

The query used to re-find a URL is often better than the query used initially to find it ... [because of] how the person has come to understand this result. [Emphasize] re-finding queries ... in the history ... The previous query may even be worth forgetting to reduce clutter.

When exposing previously found results, it is sometimes useful to label or name those results, particularly when those results are exposed as a set. Re-finding queries may make useful labels. A Web browser could even take these bookmark queries and make them into real bookmarks.

A previously found result ... may be what the person is looking for ... even when the result [normally] is not going to be returned ... For example, [if] the user's current query is a substring of a previous query, the search engine may want to suggest the results from the history that were clicked from the longer query. In contrast, queries that overlap with but are longer than previous queries may be intended to find new results.

[An] identical search [is] highly predictive of a repeat click ... [We] can treat the result specially and, for example, [take] additional screen real estate to try to meet the user's information need with that result ... [with] deep functionality like common paths and uses in [an expanded] snippet. For results that are re-found across sessions, it may make sense instead to provide the user with deep links to [some] new avenues within the result to explore.

At the beginning of a session, when people are more likely to be picking up a previous task, a search engine should provide access into history. In the middle of the session ... focus on providing access to new information or new ways to explore previously viewed results. At the end of a session ... suggest storing any valuable information that has been found for future use.
Great advice from Jaime Teevan at Microsoft Research. For more on this, please see my earlier post, "People often repeat web searches", which summarizes a 2007 paper by Teevan and others on the prevalence of re-finding.

Sunday, March 14, 2010

GFS and its evolution

A fascinating article, "GFS: Evolution on Fast-Forward", in the latest CACM magazine interviews Googler Sean Quinlan and exposes the problems Google has had with the legendary Google File System as the company has grown.

Some key excerpts:
The decision to build the original GFS around [a] single master really helped get something out the door ... more rapidly .... [But] problems started to occur ... going from a few hundred terabytes up to petabytes, and then up to tens of petabytes ... [because of] the amount of metadata the master had to maintain.

[Also] when you have thousands of clients all talking to the same master at the same time ... the average client isn't able to command all that many operations per second. There are applications such as MapReduce, where you might suddenly have a thousand tasks, each wanting to open a number of files. Obviously, it would take a long time to handle all those requests and the master would be under a fair amount of duress.

64MB [was] the standard chunk size ... As the application mix changed over time, however, ways had to be found to let the system deal efficiently with large numbers of files [of] far less than 64MB (think in terms of Gmail, for example). The problem was not so much with the number of files itself, but rather with the memory demands all those [small] files made on the centralized master .... There are only a finite number of files you can accommodate before the master runs out of memory.

Many times, the most natural design for some application just wouldn't fit into GFS -- even though at first glance you would think the file count would be perfectly acceptable, it would turn out to be a problem .... BigTable ... [is] one potential remedy ... [but] I'd say that the people who have been using BigTable purely to deal with the file-count problem probably haven't been terribly happy.

The GFS design model from the get-go was all about achieving throughput, not about the latency at which it might be achieved .... Generally speaking, a hiccup on the order of a minute over the course of an hour-long batch job doesn't really show up. If you are working on Gmail, however, and you're trying to write a mutation that represents some user action, then getting stuck for a minute is really going to mess you up. We had similar issues with our master failover. Initially, GFS had no provision for automatic master failure. It was a manual process ... Our initial [automated] master-failover implementation required on the order of minutes .... Trying to build an interactive database on top of a file system designed from the start to support more batch-oriented operations has certainly proved to be a pain point.

They basically try to hide that latency since they know the system underneath isn't really all that great. The guys who built Gmail went to a multihomed model, so if one instance of your GMail account got stuck, you would basically just get moved to another data center ... That capability was needed ... [both] to ensure availability ... [and] to hide the GFS [latency] problems.

The model in GFS is that the client just continues to push the write until it succeeds. If the client ends up crashing in the middle of an operation, things are left in a bit of an indeterminate state ... RecordAppend does not offer any replay protection either. You could end up getting the data multiple times in a file. There were even situations where you could get the data in a different order ... [and then] discover the records in different orders depending on which chunks you happened to be reading ... At the time, it must have seemed like a good idea, but in retrospect I think the consensus is that it proved to be more painful than it was worth. It just doesn't meet the expectations people have of a file system, so they end up getting surprised. Then they had to figure out work-arounds.
Interesting to see exposed the warts of Google File System and Bigtable. I remember when reading the Bigtable paper being surprised that it was layered on top of GFS. Those early decisions to use a file system designed for logs and batch processing of logs as the foundation for Google's interactive databases appear to have caused a lot of pain and workarounds over the years.

On a related topic, a recent paper out of Google, "Using a Market Economy to Provision Compute Resources Across Planet-wide Clusters" (PDF), looks at another problem Google is having, prioritizing all the MapReduce batch jobs at Google and trying to maximize utilization of their cluster. The paper only describes a test of one promising solution, auctioning off the cluster time to incent developers to move their jobs to non-peak times and idle compute resources, but still an interesting read.

Update: A year later, rumor has it that a new version of GFS (called Colossus, aka GFS2) resolves the problems with Bigtable I described here. Quoting: "[Bigtable] is [now] underpinned by Colossus, the distributed storage platform also known as GFS2. The original Google File System ... didn't scale as well as the company would like. Colossus is specifically designed for BigTable, and for this reason it's not as suited to 'general use' as GFS was."

Thursday, March 11, 2010

The Onion on Google's data

The Onion has a hilarious article, "Google Responds To Privacy Concerns With Unsettlingly Specific Apology", that should be enjoyable for this crowd. An excerpt as a teaster:
Acknowledging that Google hasn't always been open about how it mines the roughly 800 terabytes of personal data it has gathered since 1998, [CEO Eric] Schmidt apologized to users -- particularly the 1,237,948 who take daily medication to combat anxiety --for causing any unnecessary distress, and he expressed regret -- especially to Patricia Fort, a single mother taking care of Jordan, Sam, and Rebecca, ages 3, 7, and 9 -- for not doing more to ensure that private information remains private.

Monday's apology comes after the controversial launch of Google Buzz, a social networking platform that publicly linked Gmail users to their most e-mailed contacts by default.

"I'd like nothing more than to apologize in person to everyone we've let down, but as you can see, many of our users are rarely home at this hour," said Google cofounder and president Sergey Brin, pointing to several Google Map street-view shots of empty bedroom and living room windows on a projection screen behind him. "And, if last night's searches are any indication, Boston's Robert Hornick is probably out shopping right now for the spaghetti and clam sauce he'll be cooking tonight ... Either that, or hunting down that blond coworker of his, Samantha, whose Picasa photos he stares at every night."
[Article found via Bruce Schneier]