Saturday, September 18, 2010

Eric Schmidt on automatic search

Google CEO Eric Schmidt talks up automatic search from mobile devices:
Ultimately, search is not just the web but literally all of your information - your email, the things you care about, with your permission - this is personal search, for you and only for you.

"The next step of search is doing this automatically. When I walk down the street, I want my smartphone to be doing searches constantly - 'did you know?', 'did you know?', 'did you know?', 'did you know?'.

This notion of autonomous search - to tell me things I didn't know but am probably interested in, is the next great stage - in my view - of search.
While I agree with the idea of heavily localized and personalized searches, especially on mobile devices, I think this autonomous search feature sounds really annoying. You don't want to get in people's way. You don't want to interrupt them with something unimportant, especially if you are interrupting someone who is trying to get something done.

Perhaps what might be desirable would be better described as recommendations and personalized advertising, not as some Googly version of Clippy popping up and chirping, "Did you know? Did you know?"

Update: Interesting discussion in the comments about whether what Google is building is really personalized advertising, not search.

Causing internal competition and low morale through compensation policy

Over at Mini-Microsoft, Microsoft employees are listing the details of their compensation changes after their performance reviews.

Reading through them, it is pretty clear that almost everyone is unhappy, both with their reviews and with their relative gains. Which is exactly what you would expect.

This is an instructive example of how forced rank and fine-grained compensation adjustments based on forced rank hurt morale and ends up competing people inside a company against each other.

What you want in an organization is people focused on working together as a team. But, when you use forced rank, a fixed compensation budget for the group, and compensation changes tied to rankings, success for everyone becomes a zero-sum game. You can do just as well by bringing down people on your team -- so you look relatively better -- as by helping people on your team. In fact, it is probably easier to do better by dragging down your colleagues because you have direct control over that.

Performance-based compensation sounds great in theory, but never works in practice, partly because managers lack the information and objectivity to implement it well, partly because people never remember what they do badly and so are almost always angered by the review and compensation adjustments.

For more on this topic, please see my earlier posts, "The problem with forced rank", "Management and total nonsense", and "Joel Spolsky on management methods".

Cuil is dead

Cuil is dead:
Cuil, the much maligned search engine that at one time had hopes of toppling Google, has gone offline ... It may be done for good. Those employees who are still with the company apparently weren't paid this week, and they're starting to say they’re looking for new jobs.
Flashback to the hype of July 2008 around Cuil:
Take yesterday's over-hyped launch of stealth search startup Cuil, which was quickly followed by a backlash when everyone realized that it was selling a bill of goods. This was entirely the company's own fault. It pre-briefed every blogger and tech journalist on the planet, but didn’t allow anyone to actually test the search engine before the launch.

The company's founders have a good pedigree ... But creating a big index is only half the battle. A good search engine has to bring back the best results from that haystack as well. Here Cuil falls short ... The results Cuil returns aren't particularly great, and sometimes completely off the mark.
And what happened soon after:
The launch of the search engine was nothing but a classic PR trainwreck, with much hype and little to show for. Cuil failed to deliver good enough results to drive anyone to change their search behavior, and quickly became the subject of backlash and criticism because of their poor performance and indexing methods that actually took websites down in the process.

I took a peek at how they're doing traffic-wise out of sheer curiosity. After all, with no less than $33 million in funding and a founding management team consisting of ex-Google search experts, something had to give, right? Well, no. Cuil isn't performing well any way you look at it ... search engine traffic is nearing rock bottom.

Tuesday, September 07, 2010

Machine learning on top of GFS at Google

Googler Tushar Chandra recently gave a talk at the LADIS 2010 workshop on "Sibyl: A system for large scale machine learning" (slides available as PDF), which discusses a system that Google has built to do large scale machine learning on top of MapReduce and GFS.

This system probably is just one of many inside the secretive search giant, but we don't often get these kinds of peeks inside. What I found most interesting was the frank discussion of the problems the Google team encountered and how they overcame them.

In particular, the Googlers talk about how they building on top of a system intended for batch log processing caused some difficulties, which they overcame by using a lot of local memory and being careful about arranging data and moving data around. Even so, the last few slides mention how they kept causing localized network and GFS master brownouts, impacting other work on the cluster.

That last problem seems to have been an issue again and again in cloud computing systems. That pesky network is a scarce, shared resource, and it often takes a network brownout to remind us that virtual machines are not all it takes to get everyone playing nice.

On a related topic, please see my earlier post, "GFS and its evolution", and its discussion of the pain Google hit when trying to put other interactive workloads on top of GFS. And, if you're interested in Google's work here, you might also be interested in the open source Mahout, which is a suite of machine learning algorithms mostly intended for running on top of Hadoop clusters.

Friday, September 03, 2010

Insights into the performance of Microsoft's big clusters

A recent article from Microsoft in IEEE Computing, "Server Engineering Insights for Large-Scale Online Services" (PDF), has surprisingly detailed information about the systems running Hotmail, Cosmos (Microsoft's MapReduce/Hadoop), and Bing.

For example, the article describes the scale of Hotmail's data as being "several petabytes ... [in] tens of thousands of servers" and the typical Hotmail server as "dual CPU ... two attached disks and an additional storage enclosure containing up to 40 SATA drives". The typical Cosmos server apparently is "dual CPU ... 16 to 24 Gbytes of memory and up to four SATA disks". Bing uses "several tens of thousands of servers" and "the main memory of thousands of servers" where a typical server is "dual CPU ... 2 to 3 Gbytes per core ... and two to four SATA disks".

Aside from disclosing what appear to be some previously undisclosed details about Microsoft's cluster, the article could be interesting because of insights into the performance of these clusters on the Hotmail, Bing, and Cosmos workloads. Unfortunately, the article suffers from taking too much as a given, not exploring the complexity of interactions between CPU, memory, flash memory, and disk in these clusters on these workloads, and not attempting to explain the many oddities in the data.

Those oddities are fun to think about though. To take a few that caught my attention:
  • Why are Bing servers CPU bound? Is it because, as the authors describe, Bing uses "data compression on memory and disk data ... causing extra processing"? Should Bing be doing so much data compression that it becomes CPU bound (when Google, by comparison, uses fast compression)? If something else is causing Bing servers to be CPU bound, what is it? In any case, does it make sense for the Bing "back-end tier servers used for index lookup" to be CPU bound?
  • Why do Bing servers have only 4-6G RAM each not have more memory when they mostly want to keep indexes in memory, appear to be hitting disk, and are "not bound by memory bandwidth"? Even if the boxes are CPU bound, even if it somehow makes sense for them to be CPU bound, would more memory across the cluster allow them to do things (like faster but weaker compression) that would relieve the pressure on the CPUs?
  • Why is Cosmos (the batch-based log processing system) CPU bound instead of I/O bound? Does that make sense?
  • Why do Cosmos boxes have more the same memory than as Bing boxes when Cosmos is designed for sequential data access? What is the reason that Cosmos "services maintain much of their data in [random access] memory" if they, like Hadoop and MapReduce, are intended for sequential log processing?
  • If Hotmail is mostly "random requests" with "insignificant" locality, why is it designed around sequential data access (many disks) rather than random access (DRAM + flash memory)? Perhaps the reason that Hotmail is "storage bound under peak loads" is that it uses sequential storage for its randomly accessed data?
Thoughts?

Update: An anonymous commenter points out that the Bing servers probably are two quad core CPUs -- eight cores total -- so, although there is only 2-3G per core, there likely is a total of 16-24G of RAM per box. That makes more sense and would make them similar to the Cosmos boxes.

Even with the larger amount of memory per Bing box, the questions about the machines still hold. Why are the Bing boxes CPU bound and should they be? Should Cosmos boxes, which are intended for sequential log processing, have the same memory as Bing boxes and be holding much of their data in memory? Why are Cosmos machines CPU bound rather than I/O bound and should they be?

Update: Interesting discussion going on in the comments to this post.