Geeking with Greg: 05/01/2006

Tuesday, May 30, 2006

First, kill all the managers

Googler Steve Yegge has a post up about management in the tech industry. Some selected excerpts:

The catch-22 of software management is that the ones who want it most are usually the worst at it ...

Software companies that prize managers above engineers are guided by their own Invisible Hand to become a henhouse of clucky managers pecking viciously at harried engineers. Sadly, most software companies fall into this trap, because they're borrowing traditional ideas about management from non-tech industries ...

I don't think anyone's figured out how to make a no-management structure work for an org with hundreds or thousands of engineers. I do know one great company that's come really close ...

Unfortunately I have neither time, nor space, nor in all likelihood permission to explain their recipe for success with almost no management. You'll just have to take my word for it: if you take all the managers away, great engineers will still build great things. Maybe even faster.

First, kill all the managers, eh, Steve?

If you have seen the harm middle managers can create, it is hard not to be sympathetic with this view. I do think Steve is right that a software engineering company can do very well with almost no management.

Mentoring and program management are probably better off being independent of technical management. Forming a unified vision for an organization, an important part of leading a group, does not require layers and layers of middle management.

Steve may not be willing to talk about Google's recipe for success, but, to the limited extent that I know it, I can.

Google has almost no management. In 2003, managers were at the director level or higher and had 50 or so reports. More managers have been added since then, but I believe that 20+ reports is the norm.

Program management is done in a separate organization. The PMs have no power over the engineers, not even an appeal to engineering managers, since there are none. The PMs try to bring order to the chaos, but they must do so by convincing people, not by commanding them.

Mentoring is done by other engineers. People learn by doing. You want people to dive into the code and learn from those who are closest to the problem.

Parts of the vision emerge from everywhere, brought together, clarified, and unified by the few managers that exist. Despite a few people wandering up other peaks, most are guided up the same hill.

Communication is direct through informal networks, not through the management hierarchy. Transparency and pressure from peers provide for accountability and limit free riding.

Titles are unimportant. A "software engineer" could be a former tenured professor or a recent college graduate. A "program manager" could be a former CTO.

To imitate Google, it is important to realize that there is more to do here than just suddenly sending your middle managers out to sleep with the fishes.

Tasks often done by managers need to be moved out of a management hierarchy. Informal networks and a culture of transparency need to be encouraged. Hierarchies must be destroyed, titles made irrelevant, and compensation and rewards redesigned.

If you can do all those things, then let's bring out the cement shoes.

Thursday, May 25, 2006

Newspapers and local content

In an AJR article on the future of newspapers, "Adapt or Die", McClatchy CEO Gary Pruitt is quoted as saying:

"It's no longer sufficient just to have that core daily newspaper; instead, we need to leverage off of it this whole portfolio of products, including, and most importantly, the leading local Internet site."

"In each market we're the leading local media company, the leading local Internet company," capable of delivering news continuously.

Newspapers face a challenge because their lucrative control of local distribution is fading. But they still have a major competitive advantage producing valuable local content.

It seems to me that newspapers should own local. When I want information about Microsoft, Amazon, or other Seattle area companies, the best source should always be the Seattle PI. When I want information about local restaurants, I should think the obvious place to go is the Seattle PI. When I want information about concerts, events, parks, politics, traffic, entertainment, news, anything local, the best place to go should be the Seattle PI.

Even more important, local newspapers should own local advertising. When I want to run ads for small Seattle businesses, I should look to the Seattle PI. I do not know all the small local businesses. I do not have connections into them. But the Seattle PI does. Similarly, when local businesses want to advertise to local customers, the obvious choice should be the advertising network provided by the Seattle PI.

Sites like Citysearch should look hollow and pathetic next to the content provided by your newspaper. The Wall Street Journal should seek out the Seattle PI for access to their in-depth reporting on Microsoft. Google Local and Yahoo Local should be begging the Seattle PI for access to their pool of local advertisers.

Newspapers should be the broker for local content. Newspapers should be the master of news and advertising content for their communities. Newspapers should be the experts of local.

See also my previous posts, "The problem for newspapers", "Matching content to audiences", and "The best of the old and the new".

See also my previous post, "Local search lacking local ads".

[AJR article via Cyberjournalist.net]

Yahoo and eBay, Amazon and Microsoft

It is being widely reported that Yahoo and eBay have formed a partnership.

To start, Yahoo will provide advertising on eBay's site, and eBay will provide PayPal payments to Yahoo (replacing Yahoo's defunct PayDirect feature). I am sure we can expect more announcements from this coupling in the future.

I would like to see a similar deal between Amazon and Microsoft. Amazon already switched its web searches from using Google to Microsoft. Danny Sullivan speculated that this deal probably already includes advertising on Amazon sites from the upcoming Microsoft adCenter.

Amazon also has a payments system similar to PayPal. Microsoft could build their own, but why bother when Amazon has one ready to go?

This would match the Yahoo/eBay deal, Microsoft running the advertising on Amazon properties and Amazon running payments on Microsoft properties. But, why stop there?

Amazon's excellent catalog and reviews could be integrated and featured in MSN Shopping, Windows Live Shopping, and Windows Live Products. Amazon's personalization technology could be deployed. Amazon's Alexa data could be surfaced in MSN Search results. Amazon's A9 could be repurposed within Microsoft Live Labs.

Amazon and Microsoft are neighbors, but have always been cold to each other. Perhaps it is time these two kissed and made up.

Wednesday, May 24, 2006

Human computation and playing games

I very much enjoyed watching this fun, exciting, and light UW CS talk, "Human Computation", by Luis von Ahn from CMU.

The talk mostly focuses on how you can get people to do useful work for free by making it fun, by turning it into a game. His motivation comes, at least in part, from noticing the millions of hours people waste playing solitaire on the computer and wondering if that effort could produce useful output instead. His games are carefully constructed to be fun, produce useful data, and avoid cheating.

Luis spent most of his time talking about the ESP Game, a popular Web-based game that, as a side effect, gets people to label images with keywords. In aggregate, this kind of data is useful for image search. People enjoy playing the game for many hours. The researchers have collected over 10M image labels as a side effect of the game play.

Luis then discussed a newer game, Peekaboom, that takes the ESP Game one step further, from just labeling images to identifying the region of the image that corresponds to the label (e.g. the "man" is here). Peekaboom has already collected millions of data points. This data set could be quite useful for computer vision research.

Luis mentioned a new game they are about to release called Verbosity (PDF). The game resembles many types popular word guessing games and, as a side effect, produces a stream of common sense knowledge (e.g. "milk is a liquid" and "milk is often found near cereal"). This data set that would be very useful for AI researchers.

By the way, Luis was one of the inventors of captchas, those twisted images of text that you are sometimes asked to read and type in on websites. Captchas are designed to distinguish humans from computers by asking you to do a task computers cannot do very well. The ESP Game and other projects are similar in that they get humans to solve problems that computers cannot easily solve.

I really like some of the lessons from Luis' work. If you have a problem that is very hard for computers to solve but easy for people to solve, see if you can make it a game. If it is fun, a lot of people will do it for free.

However, as Luis said, you do have to expect people to cheat; features to increase data quality and hinder cheating and spam were built-in as part of their early game design.

This is a great talk, well worth watching. A lot of fun and many good ideas here.

Monday, May 22, 2006

AI and the future of search

Saeed Shah at The Independent (UK) reports that:

Google's ultimate aim is to create a search engine with artificial intelligence to exactly answer any question a user puts to it.

Larry Page, the co-founder and president ... said: "People always make the assumption that we're done with search. That's very far from the case. We're probably only 5 per cent of the way there. We want to create the ultimate search engine that can understand anything ... some people could call that artificial intelligence."

Mr Page ... said it was not possible to predict when Google would achieve this goal, although he pointed out that "a lot of our systems already use learning techniques".

See also my previous posts, "Google and The Happy Searcher", "Search without searching" and "Google and question answering".

[Found on Findory]

Update: An excerpt from a similar article by Richard Wray in the UK Guardian:

"The ultimate search engine would understand everything in the world. It would understand everything that you asked it and give you back the exact right thing instantly," Mr Page [said] ... "You could ask 'what should I ask Larry?' and it would tell you."

Mr Page said one thing that he had learned since Google launched eight years ago was that technology can change faster than expected, and that AI could be a reality within a few years.

One thing you have to say about the Google founders, they certainly are ambitious.

See also more of my earlier posts, "The perfect search", "Perfect search and the clickstream", and "Different visions of the future of search".

[UK Guardian article via Threadwatch]

Sunday, May 21, 2006

KnowItAll talk

I finally got a chance to watch Oren Etzioni's Feb 2006 talk, "All I Really Need to Know I Learned from Google".

Oren is a fun and engaging speaker. The talk is quite interesting, a good overview of the KnowItAll family of research projects.

Oren and his team are seeking to extract knowledge from the massive number of documents on the web. The goal is to use this knowledge for a next generation of search that presents information summarized and collated from many web documents, like question answering, but deeper.

Some compelling examples of this idea come from the Opine project, a spin-off from KnowItAll that focuses on product reviews. It attempts to summarize many separate reviews into percentages of favorable and unfavorable opinions and the product features that went into those opinions.

If this kind of stuff sounds like something the search giants might want, you'd be right. Oren mentions in the talk that his graduate students seem to keep getting poached.

See also my previous posts, "Google and question answering" and "Mining the peanut gallery".

Google's other services and market share

I was surprised to see such low market share for Google Maps (13% of Mapquest) and Google News (30% of Yahoo News) in this Hitwise data.

Given that Google is the dominant web search, I expected prominent placement of Google Maps and Google News at the top of Google web search results to cause more people to switch.

For example, I normally end up at Google Maps because I lazily drop addresses into Google web search, then click on the first link I see. I would have thought more people would do the same.

I wonder what is driving all the traffic to Mapquest. Are people going there directly any time they need a map? Or is it partnerships with websites? What is supporting Mapquest's high market share?

Very curious.

[via Richard MacManus and Niall Kennedy]

Update: Richard MacManus also points to ComScore numbers that show Google's overall web search market share is growing.

Looking at the slower progress in the verticals, Richard argues, as some have in the comments to this post, that the low Google market share because the products have only been around for a couple years. I agree, but it is still surprising to me that prominent placement at the top of Google web search results is not causing more rapid adoption of Google Maps and Google News.

Friday, May 19, 2006

Talk on Web security

Mike Andrews gave a long but light talk at Google, "How to Break Web Software".

It is a nice overview of a lot of Web development security issues. It is good for a refresher if you've seen all this before and should be required watching for any web developer that hasn't.

The advice basically comes down to one thing: Never trust the client.

Whether it is form input, URLs, cookies, or XML from AJAX apps, always assume that anything that anything from the web browser needs to be validated, filtered, and verified.

One interesting tidbit from the talk was that a lot of security issues come from the way Javascript in HTML mixes code and data; cross-site scripting (XSS) attacks are the biggest issue right now, bigger than SQL injection.

Mike expected web security issues to get worse with increased use of AJAX, both because it moves more processing out to untrusted clients and because there is a lot more data flying back and forth between client and server.

Microsoft drops forced rank, increases perks

Todd Bishop at the Seattle PI reports that Microsoft is dropping its much hated forced rank system and increasing employee perks.

An excerpt:

We are retiring the 2.5-5.0 rating scale and introducing a three point Commitment Rating scale of Exceeded, Achieved and Underperformed. ... There will be no forced distribution (i.e. curve) associated with this commitment rating, which allows managers and employees to have a more candid discussion about performance.

We're planning to provide on-campus access to a variety of services, including laundry and dry cleaning, grocery delivery from Safeway and opening convenience stores -- all of which are designed to ease the burden given the hectic pace of life. We will expand and upgrade dining services adding great new retail food in select cafes, dinners to go from Wolfgang Puck and other services. We are also arranging discounts on a variety of home services including house keeping, yard care, pet care, auto services and more.

See also this WashTech article that blamed the forced rank reviews and compensation based on forced rank for poor morale at Microsoft.

See also my previous post, "Early Amazon: Just do it", where I said, "While merit pay sounds like a great idea in theory, it seems it never works in any large organization ... It never makes people happy."

See also my previous posts, "Free food at Google" and "Microsoft cuts benefits".

Thursday, May 18, 2006

Accuracy of Alexa metrics

Alexa has a feature that shows traffic data for websites. For example, here is an Alexa traffic chart for Findory.com.

Given how many people appear to use Alexa data to make serious business decisions, it seems like Alexa traffic charts are thought of as reliable and accurate. But are they?

As the Alexa site itself warns, the data is not accurate or reliable. An excerpt:

Alexa computes traffic rankings by analyzing the Web usage ... of Alexa Toolbar users.

The Alexa Toolbar works only with the Internet Explorer browser ... The Alexa Toolbar works only on Windows operating systems ... The rate of adoption of Alexa software in different parts of the world may vary widely.

Sites with relatively low traffic will not be accurately ranked by Alexa.

The data is only for Alexa toolbar users, a small and heavily biased sample of Web users.

I recently analyzed the Findory.com logs to look at how large this sample might be. On May 17, only 241 hits on the Findory.com website have been from people with the Alexa Toolbar installed. That's less than 0.1% of the total hits on Findory.com in that period, a tiny fraction. And those 241 hits came from only 49 unique visitors.

It is so small a number that it would be trivial to manipulate. Simply installing the Alexa Toolbar and browsing daily through the site could double the page views reported by Alexa. Asking a few dozen regular Findory users to install the Alexa Toolbar could double the reported reach.

I think it would be pretty lame to do that. Trying to manipulate metrics instead of building things that are real always strikes me as a foolish waste of time. However, it does appear many people seem to be spending a lot of time manipulating Alexa numbers, probably because real money flows to those that do.

Clearly, Alexa traffic charts should be used only with careful caveats. Only for large sites, over 10M page views per day, would I consider the data reliable. Otherwise, the tiny, biased sample easily can be manipulated.

Update: Six months later, Jason Calacanis runs an experiment to game Alexa metrics. Interesting.

Update: Eight months later, Peter Norvig writes about problems with Alexa data in his post, "Alexa Toolbar and the Problem of Experiment Design".

Wednesday, May 17, 2006

Mark Fletcher at Startup SIG

Lots of great stuff in Niall Kennedy's transcript of Bloglines founder Mark Fletcher's Startup SIG talk. Well worth reading.

Pretty impressive that Bloglines was built, run for two years, and sold to Ask.com, spending less than $200k from start to finish.

Update: Mark also posted his slide deck from the talk.

Microsoft combining desktop and web search

Allison Linn at the AP reports that Microsoft will start integrating desktop and intranet search results with their internet web search.

From the initial reports, the new Microsoft product sounds very similar to Google Desktop Search, which nicely integrates search results from your desktop with all of your web searches.

Longer term, this likely is the beginning of a Microsoft strategy to win the search war by leveraging their control of the desktop. For more on that, see my earlier post, "Using the desktop to improve search".

[Found on Findory]

AJAX development with GWT

Bret Taylor posts that Google has publicly released the Google Web Toolkit, a framework for AJAX web application development.

You write your code in Java, then compile the Java code into HTML and Javascript.

GWT supports all browsers, does not hork the back button, makes RPC easy and more robust, and can be debugged using the Java debugger. Very cool.

AJAX development by hand is a royal pain, error prone and difficult to debug. The trendy Ruby on Rails owes a lot of its success to bundling with the Script.aculo.us AJAX library to make it super quick and easy to produce snazzy AJAX web apps. GWT appears to offer similar benefits.

Tuesday, May 16, 2006

C-Store and Google BigTable

I came across an interesting VLDB 2005 paper, "C-Store: A Column-oriented DBMS" (PDF).

What attracted me to this paper, other than that Mike Stonebraker is lead author, was that the goals seem to have a lot in common with what appeared to motivate Google's BigTable.

C-Store is column-oriented (values for a column are stored contiguously) instead of row-oriented like most databases. It is optimized for reads. It is designed for sparse table structures and compresses data. It is designed for high availability on a large cluster. It has relaxed consistency on reads to minimize lock contention. It is extremely fast, two orders of magnitude faster than normal row-oriented databases on reads in their preliminary tests.

Google's BigTable is also column-oriented (storing compressed <row, column, timestamp> triples in the SSTable structures). It optimized for reads. It is designed for sparse table structures and compresses data. It has relaxed consistency. It is extremely fast.

There are some big differences. BigTable is not designed to support arbitrary SQL; it is a very large, distributed map. BigTable emphasizes massive data and high availability on very large clusters more than C-Store. BigTable is designed to support historical queries (e.g. get data as it looked at time X). BigTable does not require explicit table definitions and strings are the only data type.

These unusual databases implementations are fascinating. I am not familiar with any other very large scale, high availability, distributed map like BigTable, nor have I heard of any RDMS with the same very large scale, high availability, read-optimized goals of C-Store.

See also my previous post, "I want a big, virtual database".

Update: Speaking of Michael Stonebraker, his new startup, Vertica, just raised $16.5M to build a new database that "provides extremely fast ad hoc SQL query performance, even for very large databases." The underlying technology apparently is based on C-Store.

Yahoo home page cries out for personalization

Havi Hoffman has the post on the Yahoo Search blog announcing an AJAX-y redesign of the Yahoo home page. Richard MacManus offers a useful and detailed review.

Most reviews appear to be mixed. I personally find the page a cluttered, confusing, and poorly prioritized mess.

Some of this is caused by redundancy. For example, I see four links to News, two links to Mail, two links to Weather, three links to Entertainment, and several links to Travel on my Yahoo home page.

Some of this seems to be due to trying to wedge too much content on the page. For example, the inline message advertising "Yahoo! Answers: Ask a question | Answer questions" is a distraction from the more useful web search.

But, the main issue is that the page does not do a good job at what should be the goal of the Yahoo home page, helping people get to interesting content and services at Yahoo.

I think the Yahoo home page should focus on doing two things:

Get people quickly to parts of Yahoo they like
Help people discover parts of Yahoo they might like but don't know about

That's it. Help me get where I want to go. Help me find new, useful, and interesting stuff.

To help me get where I want to go, the page should feature things I use at Yahoo.

To help me discover new stuff, the site should recommend things based on what I already use. These are essentially internal ads for Yahoo content. If I fail to show interest, the ad should disappear and be replaced with something else that might be useful to me.

The Yahoo home page must stop trying to squeeze content for every group at Yahoo on their home page. That only generates a cluttered and useless mess, filled with distractions.

Different people should see different Yahoo home pages based on their interests and needs. The Yahoo home page should focus on helping me find and discover useful Yahoo content.

See also my October 2004 post, "Yahoo's clutter", and my June 2004 post, "Yahoo finally testing new home page".

Monday, May 15, 2006

Stats on Yahoo Answers

The Yahoo Answers team posts on the Yahoo Search blog that over 10M answers have been submitted on the site in five months.

It is hard to know what to know what to make of that number. On the one hand, it shows a lot of traffic and activity on the site. On the other hand, it is unclear how many of those answers are useful answers to interesting questions.

I thought more data might help here, so I tried to gather some additional statistics. Here is what I seemed to have been able to find. As of this morning:

1,294,389 questions have been asked on Yahoo Answers.
1,038,040 questions are "resolved", meaning a "best" answer was picked.
356,455 questions were resolved with a 80-100% "thumbs up" rating.
103,795 questions have been asked in the last 7 days.

Interesting data. It appears the average question may get about ten answers. Surprisingly high.

On the last number, it might indicate very strong growth (10% of the questions in the system asked in the last week), but I suspect the "last 7 days" number includes questions that are Yahoo will delete (due to lack of an answer), so it probably is not correct to come to that conclusion.

Before I started digging into these stats, I was guessing that about 1-2% of the 10M answers were useful. Given that 356k questions were resolved with 80-100% approval on the final answer, it appears the quality might be better than I thought. By one measure, at least 3-4% of 10M answers seem to be useful, probably more.

None of this directly answers the question about how useful Yahoo Answers will be, but this initial usage data is interesting and surprisingly positive.

For my much more pessimistic views on Yahoo Answers, see my earlier posts, "Yahoo Answers and wisdom of the crowd", "Summing collective ignorance", and "MSN Answers coming?".

Sunday, May 14, 2006

Findory in the Wall Street Journal

In her article "Me, Me, Me", Jessica Mintz at the Wall Street Journal talks about the dream of a personalized newspaper and efforts to apply personalization to news.

Rojo, Newsvine, and Findory are given as examples of "a new generation of Web start-ups" that "help readers deal with the sheer volume of material that's out there" by learning from "the reading habits of their users" and "then [using] that data to make suggestions to individuals based on what others like them are reading."

An excerpt on Findory from the article:

Findory.com ... relies heavily on the behavior of its users, but doesn't require them to list their interests, select feeds or vote on stories.

Instead, it works on the same principle as Amazon.com's recommendations ... Findory ... looks at an individual's reading history, compares it with similar readers' tastes, and offers up links to stories that similar readers have enjoyed ...

Each time the user returns to the Findory home page after clicking on an article, he or she will find the page reconfigured with a different mix of stories.

Just read articles, that's it. Findory learns from the articles you read, adapts to your interests, and builds you a personalized front page. Findory gets better and better the more you read.

Tim O'Reilly and defining Web 2.0

In his commencement speech at UC Berkeley, Tim O'Reilly finally delivered a simple, concise definition of Web 2.0: Harnessing collective intelligence.

Excerpts from his speech:

A true Web 2.0 application is one that gets better the more people use it. Google gets smarter every time someone makes a link on the web. Google gets smarter every time someone makes a search. It gets smarter every time someone clicks on an ad. And it immediately acts on that information to improve the experience for everyone else.

It's for this reason that I argue that the real heart of Web 2.0 is harnessing collective intelligence.

The world of Web 2.0 *can* be one in which we share our knowledge and insights, filter the news for each other, find out obscure facts, and make each other smarter and more responsive. We can instrument the world so it becomes something like a giant, responsive organism.

Harnessing collective intelligence. That is something I can get behind. Simple, clean, concise, and compelling.

It's nothing like Tim's first attempt at defining "Web 2.0", a verbose and confusing essay that lumbered in at five pages. His second "compact definition" remained a baffling mess of buzzwords with no real clarity or compelling direction.

No surprise that Joel Spolsky said Web 2.0 is "a big, vague, nebulous cloud of pure architectural nothingness" and that "when people use the term Web 2.0, I always feel a little bit stupider for the rest of the day." Many others felt lost in the Web 2.0 fog, unsure if it was about mashups, AJAX, RSS, or something else entirely.

I like this new definition of Web 2.0, "harnessing collective intelligence." I like the idea we are building on the expertise and information of the vast community of the Web. I like the idea that web applications should automatically learn, adapt, and improve based on needs.

I also like the idea that "Web 2.0" should include many companies that people were trying to classify as "Web 1.0". Amazon.com, with its customer reviews and personalized pages, clearly is harnessing the collective wisdom of Amazon shoppers. Google also is constantly improving based on the behavior of searchers.

Web 2.0 applications get better and better the more people use them. Web 2.0 applications learn from the behavior of their users. Web 2.0 applications harness collective intelligence.

Saturday, May 13, 2006

Microsoft investors grumbling about search war

Dina Bass at Bloomberg News reports that:

Microsoft Corp. Chief Executive Steve Ballmer is getting an earful from investors ...

Ballmer is spending in areas to help compete against Google ... Microsoft said last month that it would spend $2 billion more than analysts expected on new projects in its next fiscal year, triggering an 11 percent one-day drop in shares ...

Analysts worry that Ballmer is trying to build a Google within Microsoft. Investors are concerned that they may never see the payoff.

See also my earlier post, "Microsoft is building a Google cluster".

[Found on Findory]

Thursday, May 11, 2006

Behavioral advertising buzz

Jennifer Slegg at SEW writes about rising interest in personalized advertising. Some excerpts:

Behavioral ... advertising ... is targeted to a specific individual based on that user's previous surfing behavior. This is quite different from the more common targeting method of displaying ads matched to the specific content of an individual page or to all users in general. With behavior targeting, this would mean that two people could see vastly different ads when viewing the identical webpage at the same time.

Studies have shown that conversions are higher when people are targeted through behavior rather than content because behavior can determine a person's actions. Whether it is looking at specific sections of an online newspaper or visiting a certain type of site more than once, those actions are used to determine each user's interests.

See also my previous posts, "Google wants to change advertising", "Microsoft adLab and targeted ads", and "Yahoo testing ads targeted to behavior".

See also my previous post, "Is personalized advertising evil?"

Wednesday, May 10, 2006

MySQL Cluster and my big virtual database

MySQL Cluster is an in-memory database running across a cluster of machines that tries to be robust to failures of nodes.

I had been looking at it off and on for a while. My curiosity was peaked when I watched Stewart Smith's talk on Google Video, "A Googly MySQL Cluster Talk", and investigated a few questions I had after that talk.

Unsurprisingly, MySQL Cluster is not a giant, simple, virtual database that runs transparently over a cluster, nodes dropping in an out of service at will, with read-write replication and data migration all done automatically and robustly. MySQL Cluster is cool, but it is not quite that cool.

The core design behind MySQL Cluster seems to have started with one thought: Instead of storing tables in the local filesystem, what if we stored them in the memory of nearby machines on the network?

This is a really neat idea. I remember some fun research work a while back that tried to exploit the fact that it is an order of magnitude faster for a process to access data from RAM on a remote machine than it is to access data from local disk. Yes, disks are that slow.

Most of this research work focused on experiments with paging out to the free memory of other machines on the network instead of local disk, but the idea still seems similar to what the MySQL folks are doing with MySQL Cluster.

So, I am guessing MySQL folks started by deciding to experiment with a new type of in-memory storage engine, one that would say, this data isn't in this machine's memory, but in that machine's memory over there.

Then, it looks like they added a bunch of stuff on top to try to make this robust. Logging of transactions out to the local filesystem, replicas of the table fragments, and goodies like that. MySQL Cluster was born.

There's a lot of hard work left to be done though. According to their limitations page, new data nodes cannot be added without a restart of the entire cluster, so you cannot add new boxes in to increase your capacity with the system live.

Other serious issues appear to include performance issues with some queries, windows of opportunity for data loss, potential for stray locks, temporary transaction failures when nodes are dropped or rebooted, complicated configuration and maintenance of the cluster, and other problems.

Even so, it is impressive what the MySQL folks have built. It may not be a big, virtual database, but it is a big step in the right direction.

Update: There are more details on how MySQL handles failures and recovery in a 2005 VLDB paper, "Recovery Principles of MySQL Cluster 5.1" (PDF).

Tuesday, May 09, 2006

Recent Google talks on Google Video

There have been several interesting talks at Google that recently were published on Google Video.

They take a long time to watch, but I enjoyed a bunch of them. Among my favorites was the informative (but troubling) "A Googly MySQL Cluster Talk", Brion Vibber's "Wikipedia and MediaWiki", and Barry Schwartz's excellent "The Paradox of Choice - Why More Is Less".

[via Nathan Weinberg]

Update: If you want to see any future Google TechTalks as they are published, it appears an RSS feed is available. Not sure if that is supported by Google -- they don't link to that feed anywhere -- but it seems to work fine.

Wikipedia and databases

Brion Vibber from Wikimedia gave an interesting talk recently at Google about Wikipedia. Most of the talk was about scaling Wikipedia under the recent massive spike in demand.

Their scaling strategy relies heavily on a large caching layer that focuses on caching entire HTML pages. The idea here is that the vast majority of accesses to Wikipedia are anonymous reads, so the same pages can be served up to those people.

This does work pretty well -- apparently, 78% of accesses are served from the Squid caches and another 7% on top of that get served from Memcached pages -- but it appears they basically have to toss the cache if anything on the page is different, including logged in users.

During the talk, I was a little surprised they were not more focused on caching at the data layer, focusing on making the underlying databases serve data rapidly instead of trying to avoid the databases.

If I understand it right, the Wikipedia architecture has all the Wikipedia data thrown into one giant database. Everything is in there, all the tables, even the large indexes for full text search. Then, they tossed on a few slave databases that appear to take replicated copies of the entire database.

All this data on one machine appears to mean they are hitting disk a lot on the database servers.

That doesn't seem necessary. The entire Wikipedia database appears to be a few hundred gigabytes. It should be possible to get most of that in-memory in a horizontally partitioned database cluster of a couple dozen machines.

In an ideal world, we'd be talking about something like the Google Cluster, where shards of the data are distributed across many machines and accessed in parallel, or a big virtual database like I craved in an earlier post.

But, let's stick with low hanging fruit here. So, to start, I would pull text search out of MySQL. Yes, I know, it's so lazily convenient to use MySQL full text search, but the performance doesn't seem to be where it needs to be. Moreover, it almost certainly is desirable for something like search to have its own dedicated hardware, not to be competiting for resources with an RDMS on the same box.

Then, I'd partition the data. Get it so each box only has a shard of the data big enough that the disk is quiet. I really am tempted to start rambling about something other than a simple partition of the existing MySQL tables and using MySQL replication here, but we're talking about low hanging fruit, so I'll stick with MySQL and what is simple to get done.

If almost all the Wikipedia database accesses never touched a disk, I suspect performance of the database layer might be good enough that parts of that HTML caching layer become unnecessary or even counterproductive. If so, the freed up boxes could be shifted from the HTML caching layer to the database layer, improving performance and scalability even further.

Near the end of the talk, Brion seemed to suggest that Wikipedia was going to focus their scaling efforts on boosting cache hit rates, more work on that last layer right before the client. It appeared to me that it might involve some fairly complicated work to figure out exactly when parts of the coarse-grained, HTML cache are valid.

I wonder if the focus might be better spent on the data layer, getting the performance there to the point that caching further out becomes unnecessary.

I have to say, it is great fun looking at this kind of large scale problem. I wish I could dig in deeper and pour over profile data for the site.

The efforts of the tiny Wikipedia development team are really impressive. Not only have they managed to scale to crushing loads, but also they have done it in an unbelievably meager budget. They have much deserved loyalty from a community that wants to see them continue to thrive and grow.

Update: Interesting March 2006 messages ([1] [2]) on the Wikitech mailing list from Brion Vibber show that Wikipedia is entertaining the idea of switching to MySQL Cluster. An excerpt:

We don't support MySQL Cluster at this time, as it's currently limited in many ways. (Everything must fit in memory, can't have certain field types, etc.)

Currently waiting for the upcoming versions which allow disk-backed data, etc.

No point in rushing in when our own contributors who are MySQL employees are telling us to wait for it to mature a bit.

MySQL Cluster looks tempting but, as Brian said, it wouldn't work right now since everything wouldn't fit in memory within the maximum cluster size and other restrictions. Probably better to look at manually partitioning the data to get it in memory.

See also my post, "MySQL Cluster and my big virtual database".

Despair on corporate spin

Maybe it is too dark for some, but the same side of me that loves Dilbert cartoons really enjoyed the Despair, Inc. video podcasts on corporate spin. Painfully hilarious.

If you like that, you might also enjoy the classic Demotivators posters.

Thursday, May 04, 2006

Search without searching

Mike Shields at MediaWeek summarizes parts of a talk by Chris Payne (Microsoft VP of Windows Live Search, former GM at Amazon.com).

I found this tidbit particularly interesting:

MSN is working on making its search product more personalized, incorporating individual users' behavior ...

Eventually, according to Payne, MSN search will become so sophisticated that users will receive search results proactively - i.e. before they even know the want to search for something.

When I was at Amazon working on personalization, we used to joke that the ideal Amazon website would just display a giant picture of one book, the next book you want to buy.

Maybe the ideal of information retrieval is to provide relevant, helpful, and useful information even in the face of very limited explicit data about what each person wants. Maybe the ideal search engine requires no search at all.

It may be a goal that we never completely can reach, but still one to which we should aspire.

See also my previous post, "Finding and discovering", where I talk about the Implicit Query project at Microsoft Research.

See also my previous post, "Personalized search at PC Forum", where I said, "If you need to read minds to prevent [people] from having to do work, well then you better read minds. They'll think it's your fault, not theirs, if you don't give them what they need."

Database war stories: Findory

Tim O'Reilly posted a short interview with me that discusses the backend architecture of Findory.

Is it Windows Live, MSN, or Microsoft?

Jennifer Slegg at SEW reports that the upcoming MSN adCenter product has changed its name to Microsoft adCenter.

Jennifer asks, "With all search related products branded under the MSN name, why change MSN adCenter to Microsoft adCenter?" Danny Sullivan adds, "I guess the biggest surprise is that they didn't call it Windows Live adCenter."

Richard MacManus is a little more blunt, saying:

It's indicative of the general branding chaos that has been evident at Microsoft in recent times. From the MSN vs Live confusion, to the just announced re-naming of adCenter from MSN adCenter to Microsoft adCenter. At the very least it looks like MSN as a brand name is being, rather clumsily, ushered out the door.

See also my previous post, "Is it Live or MSN?"

See also Dare Obasanjo's post, "One of These is Not Like the Others".

Update: Now this is getting funny. Microsoft apparently decided to name two entirely different products "Windows Live Search", a move Mary Jo Foley calls "a new low, in terms of bad naming choices" and Todd Bishop labels the "George Foreman naming strategy".

Wednesday, May 03, 2006

Early Amazon: The end

I spent several more years at Amazon. Amazon grew and grew.

Amazon expanded from a tiny online bookstore into an online superstore, selling books, music, videos, software, video games, electronics, toys, hardware, clothing, jewelry, and much, much more.

In later years, I went on to lead the software team in the Personalization group. We did great things. I am proud of everything we accomplished.

But, those first years will always hold a special place in my memories. It is something I likely will see only once in my lifetime.

Below are links to all the posts in my Early Amazon series. I hope you enjoyed reading them as much as I enjoyed writing them.

The Early Amazon series:

I feel privileged to have had the opportunity to have worked at Amazon.com along side such talented, dedicated, and passionate people. It was a remarkable experience.

Update: See also my post about Amazon.cult.

Yahoo, here comes Microsoft?

Robert Guth and Kevin Delaney at the WSJ report that Microsoft is considering acquiring part of Yahoo. Some excerpts:

One faction within Microsoft Corp. is promoting a bold strategy in the company's battle with Google Inc: Join forces with Yahoo Inc.

A Microsoft-Yahoo combination could merge complementary strengths. To succeed in Internet-search advertising -- the business driving Google's growth -- a competitor needs three core elements: strong technology, a mass of consumers and a universe of different advertisers.

Microsoft is spending untold hundreds of millions of dollars on the technology piece, but it doesn't yet have enough consumers using its MSN service to entice the needed advertisers.

A tie-up with Yahoo could address part of that problem. It has more than 100 million people visiting its site a month, making it the most popular Web site in the U.S. So far it is losing the race to Google when it comes to the technology for matching ads to consumer search queries, though it plans to unveil an upgrade to its system this month.

Combined, MSN and Yahoo would have all three pieces and, at least on paper, could leapfrog Google.

Behind the scenes at Microsoft there are two factions of thinking about a Yahoo deal, say people familiar with Microsoft.

One, largely led by MSN veterans, has been focused on Microsoft building its own answer to Google. So far that group has prevailed.

Pushing for more is Hank Vigil, a Microsoft senior vice president who internally is advocating for Microsoft to do a major deal such as a tie-up with Yahoo.

If a Microsoft/Yahoo partnership were to happen, I suspect it would be motivated more by frustration than anything. MSN management has been promising for years that they are "six months" away from catching Google, a window that seems to keep shifting forward.

While the article correctly argues that MSN's focus should be on internet advertising, mashing the MSN and Yahoo beasties together is unlikely to yield beautiful offspring. If the problem is increasing the number of websites carrying Microsoft's advertising, as the article suggests, I would think MSN would sign on many smaller players rather than becoming entangled with Yahoo.

This is not the first time Microsoft has considered a mega-merger like this. Eight months ago, MSN was in talks with AOL, talks that apparently included the possibility of merger. For more on that, see my previous post, "MSN and AOL, the kissing behemoths".

In any case, I do think we will see a frenzy of biz dev deals coming out of MSN. For example, Amazon recently switched to Microsoft for web search instead of Google, a deal that, as Danny Sullivan mentions, may include running MSN adCenter ads on Amazon.com websites. I am sure there is much more of this to come.

[WSJ article found via Danny Sullivan]

Update: Five months later, this idea of Microsoft acquiring Yahoo is coming up again.

Tuesday, May 02, 2006

Data on Google Mobile Search

A couple Googlers wrote a CHI 2006 paper (PDF) that summarizes the characteristics of the searches on Google Mobile Search.

Update: On a related note, I enjoyed this talk by Patrick Baudisch from Microsoft Research about overcoming the challenges of the small displays on mobile devices.

Personalized news from Spotback

Michael Arrington reports that a new startup, Spotback, has launched a personalized news site.

From the Spotback FAQ:

Spotback is a new breed of personalized news service. It is designed to quickly learn each user's fields of interest and style by analyzing how users rate and interact with news information.

It then offers users the most interesting, relevant and hard to filter news information personally tailored to their taste. Spotback uses sophisticated algorithms that analyze social behavior.

The site is clean, responsive, and easy to use. No login is required, just start rating articles. When you rate an article, a similar article immediately will slide on to your screen in a nifty AJAX-y way.

Unlike Findory or the recommended stories section in Google News or MSN Newsbot, articles you click on and read do not change your recommendations. On Spotback, only explicitly rated articles are used for the personalization.

Spotback recommended articles are marked with a button that says, "[people] also liked". Clicking on that button brings up an explanation of why the article was recommended. The explanations list similar people to you, suggesting that Spotback is using a form of collaborative filtering or, most likely, user clustering to power their recommendation engine.

In my experience, the recommendations seemed off. Rating three articles about Google positively brought up recommendations for "Guerrilla marketing Euro Cafes", "Online and Live Poker", and "People Talking about Architecture" among other things.

This could indicate a problem with the recommendations algorithms or could be due to lack of user behavior data. It will be interesting to watch Spotback over time and see how the recommendations change.

Spotback gives me flashbacks to Memigo, a news recommendation site that preceded Findory, though Memigo is missing the wiz-bang AJAX features.

Interesting to see new startups launching around the idea of personalized information.

Monday, May 01, 2006

Recommending advertisements

Omid Madani and Dennis DeCoste at Yahoo Research authored a short paper, "Contextual Recommender Problems" (PDF), that discusses targeting advertisements as a recommendations problem.

Some extended excerpts:

When a user visits a page, the systems task is to pick a certain ad topic, and from that ad topic pick a certain ad to show. The objective is to maximize click rate over some period of time.

We could ... [treat this as a] ... Bayesian formulation of the n-armed bandit problem ... and figure out which arms work best for that person.

A major problem we face with this approach is the problem of sparsity: there may be many ad topics available (thousands and beyond) ... the number of interactions we may get from a single user ... may be very small ... The average baseline click rate ... is very low (e.g., below one percent).

Information about user behavior and potential user and arm similarities that can ... help the choice of displayed arms.

When we want to select ads to display to a single user ... the problem is very similar to recommendation problems and involves many similar issues: users ... with similar tastes, missing values, and our choice of the columns/items to show.

Several recommender solutions methods, in particular collaborative filtering approaches, as well as techniques such as dimensionality reduction and clustering, nearest neighbors and other machine learning methods, apply ... as well.

The challenge here is how to do exploration and exploitation with the understanding that information obtained about a single user can help the whole community of users, and information about the community can help better serve a single user.

We are not aware of prior research that addresses exploration and exploitation in such large dimensional spaces, taking community (collaborative filtering effects) into account.

Targeting advertisements requires matching millions of ads to millions of users on billions of web pages.

Data is extremely sparse on which ads are most effective for specific people and pages. Overcoming the sparse data will require combining contextual information about the ads and pages with knowledge of user interests and behavior to determine similar ads, people, and pages.

The problem then becomes a recommendation problem. For a specific user seeing a specific page, the most relevant ads most likely will be similar to ads that similar users found relevant on similar pages.

The combination of ads we pick will depend on whether we are exploring, learning more about how well some ads work for some pages and some users, or exploiting, seeking to maximize revenue based on the data we already have.