Saturday, February 27, 2010

Personalization and differential pricing

Google's Chief Economist Hal Varian has a new paper out, "Computer Mediated Transactions" (PDF). An excerpt of his predictions on personalization:
Instead of a "one size fits all" model, the web offers a "market of one" ... [powered by] suggestions of things to buy based on your previous purchases, or on purchases of customers like you.

Not only content, but prices may also be personalized, leading to various forms of differential pricing ... [But] the ability of firms to extract surplus [may be] quite limited when consumers are sophisticated ... [And] perfect price description and free entry ... pushes profits to zero, conferring all benefits to the customers.

The same sort of personalization can occur in advertising ... Google and Yahoo ... [already] allow users to specify their areas of interest and then see ads related to those interests. It is also relatively common for advertisers ... to show ads based on previous responses of users to related ads.
Back in 2000, Amazon got slammed (e.g. [1]) for an experiment with differential pricing, but Hal appears to be predicting differential pricing will rise again.

The paper also talks briefly about how experimentation changes how companies make decisions ("when experiments are cheap, they are likely provide more reliable answers than opinions"), data mining, online advertising, legal contracts that use computer monitoring to enforce their terms, and cloud computing. The paper is from the 2010 Ely Lecture at the American Economics Association and video of the talk is available.

Tuesday, February 23, 2010

How we all teach Google to Google

Steven Levy at Wired just posted an article, "How Google's Algorithm Rules the Web", with some fun details on how Google uses constant experimentation, logs of searches and clicks, and many small tweaks to keep improving their search results.

Well worth reading. Some excerpts as a teaser:
[Google Fellow Amit] Singhal notes that the engineers in Building 43 are exploiting ... the hundreds of millions who search on Google. The data people generate when they search -- what results they click on, what words they replace in the query when they're unsatisfied, how their queries match with their physical locations -- turns out to be an invaluable resource in discovering new signals and improving the relevance of results.

"On most Google queries, you're actually in multiple control or experimental groups simultaneously," says search quality engineer Patrick Riley. Then he corrects himself. "Essentially," he says, "all the queries are involved in some test." In other words, just about every time you search on Google, you're a lab rat.

This flexibility -- the ability to add signals, tweak the underlying code, and instantly test the results -- is why Googlers say they can withstand any competition from Bing or Twitter or Facebook. Indeed, in the last six months, Google has [found and] made more than 200 improvements.
Even so, this raises the question of where the point of diminishing returns is with more data and more users. While startups lack Google's heft, Yahoo and Bing are big enough that -- if they continuously experiment, tweak, and learn from their data as much as Google does -- search quality differences likely would be in an imperceptibly small chunk of long tail queries.

Google Reader recommends articles

In a post on the official Google Reader blog, "May we recommend...", Laurence Gonsalves describes a new recommendation feature for Google Reader that recommends articles based on what you have read in the past. An excerpt:
Many of you wanted to see even more personalized recommendations ... [Now], we've started inserting items selected just for you inside the Recommended items section. This is great if you've got interests that are less mainstream. If you love Lego robots, for example, then you should start to notice more of them in your Recommended items.
Sadly, no additional details appear to be available. In my usage, there were rare gems in the recommendations, but a lot of randomness, and a strong bias toward very popular items. The lack of explanation -- why was this item recommended? -- and lack of a way to correct the recommendations likely will make people less forgiving of these problems. I also saw recommendations for items I had already read; items you have already seen always should be filtered from recommendations.

For more on that, you might enjoy some of my previous posts on this topic, such as the Mar 2009 "What is a good recommendation algorithm?" and the much older Dec 2006 "The RSS beast".

Tuesday, February 02, 2010

New details on LinkedIn architecture

Googler Daniel Tunkelang recently wrote a post, "LinkedIn Search: A Look Beneath the Hood", that has slides from a talk by LinkedIn engineers along with some commentary on LinkedIn's search architecture.

What makes LinkedIn search so interesting is that the search does real-time updates (the "time between when user updates a profile and being able to find him/herself by that update need to be near-instantaneous"), faceted search (">100 OR clauses", "NOT support", complex boolean logic, some facets are hierarchical, some are dynamic over time), and personalized relevance ranking of search results (ordered by distance in your LinkedIn social graph).

LinkedIn appears to use a combination of aggressive partitioning, keeping data in-memory, and a lot of custom code (mostly modifications to Lucene, some of which have been released open source) to handle these challenges. One interesting tidbit is, going against current conventional wisdom, LinkedIn appears to only use caching minimally, preferring to spend their efforts and machine resources on making sure they can recompute computations quickly than on hiding poor performance behind caching layers.

Friday, January 29, 2010

Gmail launches personalized ads

Google's popular mail service, GMail, has launched advertising targeted not just to the particular e-mail message you are reading, but to other e-mails you might have read recently. An excerpt:
Sometimes, there aren't any good ads to match to a particular message. From now on, you'll sometimes see ads matched to another recent email instead.

For example, let's say you're looking at a message from a friend wishing you a happy birthday. If there aren't any good ads for birthdays, you might see the Chicago flight ads related to your last email instead.
It is a significant move toward personalized advertising and, as the Google post notes, is a big change for Google, as they previously "had specified that ads alongside an email were related only to the text of the current message." For example, here, Google says, "Ads and links to related pages only appear alongside the message that they are targeted to, and are only shown when the Google Mail user ... is viewing that particular message."

For more on personalized ads that target not only the current content, but also to previously viewed content that has strong purchase intent, please see my July 2007 post, "What to advertise when there is no commercial intent?"

Wednesday, January 27, 2010

Yahoo on personalizing content and ads

Yahoo CEO Carol Bartz had a few tidbits on personalized relevance for content and advertising in the recent Yahoo Q4 2009 earnings call. Some excerpts:
We generate value ... [through] the vast amount of data we gather and use to deliver a better, more personal experience for users and a better, more targeted audience for our advertisers.

Since we began paring our content optimization technology with editorial expertise we have seen click through rates in the Today module more than double ... We are making additional improvements to the technology that will make the user experience even more personally relevant.

Truth be told, no one has uncovered the holy grail of making advertising as relevant as content is 100% of the time. Beyond just offering advertisers a specific bucket, say women aged 35-45 and have children, we instead need to deliver many more specific attributes of scale. For example, women aged 35-45 with kids under three who are shopping for a minivan, and on and on and on and on. If we can do this we can create a better experience for both the user and the advertiser.

We have been letting great data about the consumers, data that is very attractive to advertisers fall to the floor ... We simply aren't even close to maximizing the value of our massive audience for advertisers.
Sounds like the goal is right, but the pace is slow. For more on that, please see also my June 2009 post, "Yahoo CEO Carol Bartz on personalization".

Sunday, January 24, 2010

Hybrid, not artificial, intelligence

Google VP Alfred Spector gave a talk last week at University of Washington Computer Science on "Research at Google". Archived video is available.

What was unusual about Al's talk was his focus on cooperation between computers and humans to allow both to solve harder problems than they might be able to otherwise.

Starting at 8:30 in the talk, Al describes this as a "virtuous cycle" of improvement using people's interactions with an application, allowing optimizations and features like like learning to rank, personalization, and recommendations that might not be possible otherwise.

Later, around 33:20, he elaborates, saying we need "hybrid, not artificial, intelligence." Al explains, "It sure seems a lot easier ... when computers aren't trying to replace people but to help us in what we do. Seems like an easier problem .... [to] extend the capabilities of people."

Al goes on to say the most progress on very challenging problems (e.g. image recognition, voice-to-text, personalized education) will come from combining several independent, massive data sets with a feedback loop from people interacting with the system. It is an "increasingly fluid partnership between people and computation" that will help both solve problems neither could solve on their own.

This being a Google Research talk, there was much else covered, including the usual list of research papers out of Google, solicitation of students and faculty, pumping of Google as the best place to access big data and do research on big data, and a list of research challenges. The most interesting of the research challenges were robust, high performance, transparent data migration in response to load in massive clusters, ultra-low power computing (e.g. powered only by ambient light), personalized education where computers learn and model the needs of their students, and getting outside researchers access to the big data they need to help build hybrid, not artificial, intelligence.

Wednesday, January 20, 2010

Predictions for 2010

It's that time of year again. Many are making their predictions for the tech industry for 2010.

It's been a while since I played this game -- last time was my dark prediction for a dot-com crash in 2008 ([1] [2]) -- but I thought I'd try again this year.

I wrote up my predictions in a post over at blog@CACM, "What Will 2010 Bring?"

Because it is for the CACM, the predictions focus more on computing in general than on startups, recommendations, or search. And, they are phrased as questions than as predictions.

I think the answer to some of the questions I posed may be no. For example, I doubt tablets will succeed this time around, don't think enterprises will move to the public cloud as much as expected, and am not sure that personalized advertising will always be used to benefit consumers. I do think netbooks are a dead market, mobile devices will become standardized and more like computers, and that 2010 will see big advances in local search and augmented reality on mobile devices.

If you have any thoughts on these predictions or some of your own to add, please comment either here or at blog@CACM.

Update: Another prediction, not in that list, that might be worth including here, "Who Needs Massively Multi-Core?"

Monday, January 04, 2010

Lectures on Computational Advertising

Slides from all the lectures of Andrei Broder's recent Computational Advertising class at Stanford University now are available online. Andrei is a VP and Chief Scientist at Yahoo and leads their Advertising Technology Group.

Lecture 6 (PDF) is particularly interesting with its coverage of learning to rank. Lecture 8 (PDF) has a tidbit on behavioral advertising and using recommender systems for advertising, but it is very brief. The first few lectures are introductory; don't miss lecture 3 (PDF) if you are new to sponsored search and want a good dive into the issues and techniques.

Thursday, December 31, 2009

YouTube needs to entertain

Miguel Helft at the New York Times has a good article this morning, "YouTube's Quest to Suggest More", on how YouTube is trying "to give its users what they want, even even when the users aren't quite sure what that is."

The article focuses on YouTube's "plans to rely more heavily on personalization and ties between users to refine recommendations" and "suggesting videos that users may want to watch based on what they have watched before, or on what others with similar tastes have enjoyed."

What is striking about this is how little this has to do with search. As described in the article, what YouTube needs to do is entertain people who are bored but do not entirely know what they want. YouTube wants to get from users spending "15 minutes a day on the site" closer to the "five hours in front of the television." This is entertainment, not search. Passive discovery, playlists of content, deep classification hierarchies, well maintained catalogs, and recommendations of what to watch next will play a part; keyword search likely will play a lesser role.

And it gets back to the question of how different of a problem Google is taking on with YouTube. Google is about search, keyword advertising, and finding content other people own. YouTube is about entertainment, discovery, content advertising, and cataloging and managing content they control. While Google certainly has the talent to succeed in new areas, it seems they are only now realizing how different YouTube is.

If you are interested in more on this, please see my Oct 2006 post, "YouTube is not Googly". Also, for a little on the technical challenges behind YouTube recommendations and managing a video catalog, please see my earlier posts "Video recommendations on YouTube" and "YouTube cries out for item authority".

Monday, December 28, 2009

Most popular posts of 2009

In case you might have missed them, here is a selection of some of the most popular posts on this blog in the last year.
  1. Jeff Dean keynote at WSDM 2009
    Describes Google's architecture and computational power
  2. Put that database in memory
    Claims in-memory databases should be used more often
  3. How Google crawls the deep web
    How Google probes and crawls otherwise hidden databases on the Web
  4. Advice from Google on large distributed systems
    Extends the first post above with more of an emphasis on how Google builds software
  5. Details on Yahoo's distributed database
    A look at another large scale distributed database
  6. Book review: Introduction to Information Retrieval
    A detailed review of Manning et al.'s fantastic new book. Please see also a recent review of Search User Interfaces.
  7. Google server and data center details
    Even more on Google's architecture, this one focused on data center cost optimization
  8. Starting Findory: The end
    A summary of and links to my posts describing what I learned at my startup, Findory, over its five years.
Overall, according to Google Analytics, the blog had 377,921 page views and 233,464 unique visitors in 2009. It has about 10k regular readers subscribed to its feed. I hope everyone is finding it useful!

Wednesday, December 16, 2009

Toward an external brain

I have a post up on blog@CACM, "The Rise of the External Brain", on how search over the Web is achieving what classical AI could not, an external brain that supplements our intelligence, knowledge, and memories.

Tuesday, December 08, 2009

Personalized search for all at Google

As has been widely reported, Google is now personalizing web search results for everyone who uses Google, whether logged in or not.

Danny Sullivan at Search Engine Land has particularly good coverage. An excerpt:
Beginning today, Google will now personalize the search results of anyone who uses its search engine, regardless of whether they've opted-in to a previously existing personalization feature.

The short story is this. By watching what you click on in search results, Google can learn that you favor particular sites. For example, if you often search and click on links from Amazon that appear in Google's results, over time, Google learns that you really like Amazon. In reaction, it gives Amazon a ranking boost. That means you start seeing more Amazon listings, perhaps for searches where Amazon wasn't showing up before.

Searchers will have the ability to opt-out completely, and there are various protections designed to safeguard privacy. However, being opt-out rather than opt-in will likely raise some concerns.
There now appears to be a big push at Google for individualized targeting and personalization in search, advertising, and news. Google now appears to be going full throttle on personalization, choosing it as the way forward to improve relevance and usefulness.

With only one generic relevance rank, Google has been finding it is increasingly difficult to improve search quality because not everyone agrees on how relevant a particular page is to a particular search. At some point, to get further improvements, Google has to customize relevance to each person's definition of relevance. When you do that, you have personalized search.

For more on recent moves to personalize news and advertising at Google, please see my posts, "Google CEO on personalized news" and "Google AdWords now personalized".

Update: Two hours later, Danny Sullivan writes a second post, "Google's Personalized Results: The 'New Normal' That Deserves Extraordinary Attention", that also is well worth reading.

Thursday, December 03, 2009

Recrawling and keeping search results fresh

A paper by three Googlers, "Keeping a Search Engine Index Fresh: Risk and Optimality in Estimating Refresh Rates for Web Pages" (not available online), is one of several recent papers looking at "the cost of a page being stale versus the cost of [recrawling]."

The core idea here is that people care a lot about some changes to web pages and don't care about others, and search engines need to respond to that to make search results relevant.

Unfortunately, our Googlers punt on the really interesting problem here, determining the cost of a page being stale. They simply assume any page that is stale hurts relevance the same amount.

That clearly is not true. Not only do some pages appear more frequently than other pages in search results, but also some changes to pages matter more to people than others.

Getting at the cost of being stale is difficult, but a good start is "The Impact of Crawl Policy on Web Search Effectiveness" (PDF) recently presented at SIGIR 2009. It uses PageRank and in-degree as a rough estimate of what pages people will see and click on in search results, then explores the impact of pages people want more frequently.

But that still does not capture whether the change is something people care about. Is, for example, the change below the fold on the page, so less likely to be seen? Is the change correcting a typo or changing an advertisement? In general, what is the cost of showing stale information for this page?

"Resonance on the Web: Web Dynamics and Revisitation Patterns" (PDF), recently presented at CHI, starts to explore that question, looking at the relationship between web content change and how much people want to revisit the pages, as well as thinking about the question of what is an interesting content change.

As it turns out, news is something where change matters and people revisit frequently, and there have been several attempts to treat real-time content such as news differently in search results. One recent example is "Click-Through Prediction for News Queries" (PDF), presented at SIGIR 2009, that describes one method of trying to know when people will want to see news articles for a web search query.

But, rather than coming up with rules for when content from various federated sources should be shown, I wonder if we cannot find a simpler solution. All of these works strive toward the same goal, understanding when people care about change. Relevance depends on what we want, what we see, and what we notice. Search results need only to appear fresh.

Recrawling high PageRank pages is a very rough attempt at making results appear fresh, since high PageRank means a page more likely to be shown and noticed at the top of search results, but it clearly is a very rough approximation. What we really want to know is: Who will see a change? If people see it, will they notice? If they notice, will they care?

Interestingly, people's actions tell us a lot about what they care about. Our wants and needs, where our attention lies, all live in our movements across the Web. If we listen carefully, these voices may speak.

For more on that, please see also my older posts, "Google toolbar data and the actual surfer model" and "Cheap eyetracking using mouse tracking".

Update: One month later, an experiment shows that new content on the Web can be generally available on Google search within 13 seconds.

Thursday, November 19, 2009

Continuous deployment at Facebook

E. Michael Maximilien has a post, "Extreme Agility at Facebook", on blog@CACM. The post reports on a talk at OOPSLA by Robert Johnson (Director of Engineering at Facebook) titled "Moving Fast at Scale".

Here is an interesting excerpt on very frequent deployment of software and how it reduces downtime:
Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of system and surely fixing any bugs that would result from these frequent small changes.

Second, there is limited QA (quality assurance) teams at Facebook but lots of peer review of code. Since the Facebook engineering team is relatively small, all team members are in frequent communications. The team uses various staging and deployment tools as well as strategies such as A/B testing, and gradual targeted geographic launches.

This has resulted in a site that has experienced, according to Robert, less than 3 hours of down time in the past three years.
For more on the benefits of deploying software very frequently, not just for Facebook but for many software companies, please see also my post on blog@CACM, "Frequent Releases Change Software Engineering".

Monday, November 16, 2009

Put that database in memory

An upcoming paper, "The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM" (PDF), makes some interesting new arguments for shifting most databases to serving entirely out of memory rather than off disk.

The paper looks at Facebook as an example and points out that, due to aggressive use of memcached and caches in mysql, the memory they use already is about "75% of the total size of the data (excluding images)." They go on to argue that a system designed around in-memory storage with disk just used for archival purposes would be much simpler, more efficient, and faster. They also look at examples of smaller databases and note that, with servers getting to 64G of RAM and higher and most databases just a couple terabytes, it doesn't take that many servers to get everything in memory.

An excerpt from the paper:
Developers are finding it increasingly difficult to scale disk-based systems to meet the needs of large-scale Web applications. Many people have proposed new approaches to disk-based storage as a solution to this problem; others have suggested replacing disks with flash memory devices.

In contrast, we believe that the solution is to shift the primary locus of online data from disk to random access memory, with disk relegated to a backup/archival role ... [With] all data ... in DRAM ... [we] can provide 100-1000x lower latency than disk-based systems and 100-1000x greater throughput .... [while] eliminating many of the scalability issues that sap developer productivity today.
One subtle but important point the paper makes is that the slow speed of current databases have made web applications both more complicated and more limited than they should be. From the paper:
Traditional applications expect and get latency significantly less than 5-10 μs ... Because of high data latency, Web applications typically cannot afford to make complex unpredictable explorations of their data, and this constrains the functionality they can provide. If Web applications are to replace traditional applications, as has been widely predicted, then they will need access to data with latency much closer to what traditional applications enjoy.

Random access with very low latency to very large datasets ... will not only simplify the development of existing applications, but they will also enable new applications that access large amounts of data more intensively than has ever been possible. One example is ... algorithms that must traverse large irregular graph structures, where the access patterns are ... unpredictable.
The authors point out that data access patterns currently need to be heavily optimized, carefully ordered, and must conservatively acquire extra data in case it is later needed, all things that mostly go away if you are using a database where access has microsecond latency.

While the authors do not go as far as to argue that memory-based databases are cheaper, they do argue that they are cost competitive, especially once developer time is taken into account. It seems to me that you could go a step further here and argue very low latency databases brings such large productivity gains to developers and benefits to application users that they are in fact cheaper, but the paper does not try to do that.

If you don't have time to read the paper, slides (PDF) are also available that are very quick to skim from a talk by one of the authors.

If you can't get enough of this topic, please see my older post, "Replication, caching, and partitioning", which argues that big caching layers, such as memcached, are overdone compared to having each database shard serve most data out of memory.

HT, James Hamilton, for first pointing to the RAMClouds slides.

Thursday, November 12, 2009

The reality of doing a startup

Paul Graham has a fantastic article up, "What Startups Are Really Like", with the results of what happened when he asked all the founders of the Y Combinator startups "what surprised them about starting a startup."

A brief excerpt summarizing the findings:
Unconsciously, everyone expects a startup to be like a job, and that explains most of the surprises. It explains why people are surprised how carefully you have to choose cofounders and how hard you have to work to maintain your relationship. You don't have to do that with coworkers. It explains why the ups and downs are surprisingly extreme. In a job there is much more damping. But it also explains why the good times are surprisingly good: most people can't imagine such freedom. As you go down the list, almost all the surprises are surprising in how much a startup differs from a job.
There are 19 surprises listed in the essay. Below are excerpts from some of them:
Be careful who you pick as a cofounder ... [and] work hard to maintain your relationship.

Startups take over your life ... [You will spend] every waking moment either working or thinking about [your] startup.

It's an emotional roller-coaster ... How low the lows can be ... [though] it can be fun ... [But] starting a startup is fun the way a survivalist training course would be fun, if you're into that sort of thing. Which is to say, not at all, if you're not.

Persistence is the key .... [but] mere determination, without flexibility ... may get you nothing.

You have to do lots of different things ... It's much more of a grind than glamorous.

When you let customers tell you what they're after, they will often reveal amazing details about what they find valuable as well what they're willing to pay for.

You can never tell what will work. You just have to do whatever seems best at each point.

Expect the worst with deals ... Deals fall through.

The degree to which feigning certitude impressed investors .... A lot of what startup founders do is just posturing. It works.

How much of a role luck plays and how much is outside of [your] control ... Having skill is valuable. So is being determined as all hell. But being lucky is the critical ingredient ... Founders who succeed quickly don't usually realize how lucky they were.
Definitely worth reading the entire article if you are at all considering a startup.

For my personal take on some surprises I hit, please see my earlier post on Starting Findory.

Tuesday, November 10, 2009

Scary data on Botnet activity

An amusingly titled paper to be presented at the CSS 2009 conference, "Your Botnet is My Botnet: Analysis of a Botnet Takeover" (PDF), contains some not-so-funny data on how sophisticated hijacking computers has now become, the data they are able to collect, and the profits that fuel the development of more and more dangerous botnets.

Extended excerpts from the paper, focusing on the particularly scary bits:
We describe our experience in actively seizing control of the Torpig (a.k.a. Sinowal, or Anserin) botnet for ten days. Torpig ... has been described ... as "one of the most advanced pieces of crimeware ever created." ... The sophisticated techniques it uses to steal data from its victims, the complex network infrastructure it relies on, and the vast financial damage that it causes set Torpig apart from other threats.

Torpig has been distributed to its victims as part of Mebroot. Mebroot is a rootkit that takes control of a machine by replacing the system's Master Boot Record (MBR). This allows Mebroot to be executed at boot time, before the operating system is loaded, and to remain undetected by most anti-virus tools.

Victims are infected through drive-by-download attacks ... Web pages on legitimate but vulnerable web sites ... request JavaScript code ... [that] launches a number of exploits against the browser or some of its components, such as ActiveX controls and plugins. If any exploit is successful ... an installer ... injects a DLL into the file manager process (explorer.exe) ... [that] makes all subsequent actions appear as if they were performed by a legitimate system process ... loads a kernel driver that wraps the original disk driver (disk.sys) ... [and] then overwrite[s] the MBR of the machine with Mebroot.

Mebroot has no malicious capability per se. Instead, it provides a generic platform that other modules can leverage to perform their malicious actions ... Immediately after the initial reboot ... [and] in two-hour intervals ... Mebroot contacts the Mebroot C&C server to obtain malicious modules ... All communication ... is encrypted.

The Torpig malware ... injects ... DLLs into ... the Service Control Manager (services.exe), the file manager, and 29 other popular applications, such as web browsers (e.g., Microsoft Internet Explorer, Firefox, Opera), FTP clients (CuteFTP, LeechFTP), email clients (e.g., Thunderbird, Outlook, Eudora), instant messengers (e.g., Skype, ICQ), and system programs (e.g., the command line interpreter cmd.exe). After the injection, Torpig can inspect all the data handled by these programs and identify and store interesting pieces of information, such as credentials for online accounts and stored passwords. ... Every twenty minutes ... Torpig ... upload[s] the data stolen.

Torpig uses phishing attacks to actively elicit additional, sensitive information from its victims, which, otherwise, may not be observed during the passive monitoring it normally performs ... Whenever the infected machine visits one of the domains specified in the configuration file (typically, a banking web site), Torpig ... injects ... an HTML form that asks the user for sensitive information, for example, credit card numbers and social security numbers. These phishing attacks are very difficult to detect, even for attentive users. In fact, the injected content carefully reproduces the style and look-and-feel of the target web site. Furthermore, the injection mechanism defies all phishing indicators included in modern browsers. For example, the SSL configuration appears correct, and so does the URL displayed in the address bar.

Consistent with the past few years' shift of malware from a for-fun (or notoriety) activity to a for-profit enterprise, Torpig is specifically crafted to obtain information that can be readily monetized in the underground market. Financial information, such as bank accounts and credit card numbers, is particularly sought after. In ten days, Torpig obtained the credentials of 8,310 accounts at 410 different institutions ... 1,660 unique credit and debit card numbers .... 297,962 unique credentials (username and password pairs) .... [in] information that was sent by more than 180 thousand infected machines.
The paper estimates the value of the data collected by this sophisticated piece of malware to be between $3M - $300M/year on the black market.

[Paper found via Bruce Schneier]

Saturday, November 07, 2009

Starting Findory: The end

This is the end of my Starting Findory series.

Findory was my first startup and a nearly five year effort. Its goal of personalizing information was almost laughably ambitious, a joy to pursue, and I learned much.

I learned that a cheap is good, but too cheap is bad. It does little good to avoid burning too fast only to starve yourself of what you need.

I re-learned the importance of a team, one that balances the weaknesses of some with the strengths of another. As fun as learning new things might be, trying to do too much yourself costs the startup too much time in silly errors born of inexperience.

I learned the necessity of good advisors, especially angels and lawyers. A startup needs people who can provide expertise, credibility, and connections. You need advocates to help you.

And, I learned much more, some of which is detailed in the other posts in the Starting Findory series:
  1. The series
  2. In the beginning
  3. On the cheap
  4. Legal goo
  5. Launch early and often
  6. Startups are hard
  7. Talking to the press
  8. Customer feedback
  9. Marketing
  10. The team
  11. Infrastructure and scaling
  12. Hardware go boom
  13. Funding
  14. Acquisition talks
  15. The end
I hope you enjoyed these posts about my experience trying to build a startup. If you did like this Starting Findory series, you might also be interested in my Early Amazon posts. They were quite popular a few years ago.

Wednesday, November 04, 2009

Using only experts for recommendations

A recent paper from SIGIR, "The Wisdom of the Few: A Collaborative Filtering Approach Based on Expert Opinions from the Web" (PDF), has a very useful exploration into the effectiveness of recommendations using only a small pool of trusted experts.

The results suggest that using a small pool of a couple hundred experts, possibly your own experts or experts selected and mined from the web, has quite a bit of value, especially in cases where big data from a large community is unavailable.

A brief excerpt from the paper:
Recommending items to users based on expert opinions .... addresses some of the shortcomings of traditional CF: data sparsity, scalability, noise in user feedback, privacy, and the cold-start problem .... [Our] method's performance is comparable to traditional CF algorithms, even when using an extremely small expert set .... [of] 169 experts.

Our approach requires obtaining a set of ... experts ... [We] crawled the Rotten Tomatoes web site –- which aggregates the opinions of movie critics from various media sources -- to obtain expert ratings of the movies in the Netflix data set.
The authors certainly do not claim that using a small pool of experts is better than traditional collaborative filtering.

What they do say is that using a very small pool of experts works surprisingly well. In particular, I think it suggests a good alternative to content-based methods for bootstrapping a recommender system. If you can create a high quality pool of experts, even a fairly small one, you may have good results starting with that while you work to gather ratings from the broader community.