Monday, December 19, 2005

The folly of ignoring scaling

David Heinemeier at 37 Signals (and creator of Ruby on Rails) wrote what I thought of as a pretty extreme post two weeks ago, "Don't scale", that argues that startups should ignore scaling and performance.

Ironically, in the following two weeks, many popular Web 2.0 startups have had problems, including a multi-day outage at del.icio.us, an 18+ hour outage at SixApart's blogging service Typepad, performance that has "sucked eggs" at Bloglines, and, as GrabPerf reports, slowness and outages at Technorati, Feedster, BlogPulse, BlogDigger, and Digg.

Stepping back for a second, a toned down version of David's argument is clearly correct. A company should focus on users first and infrastructure second. The architecture, the software, the hardware cluster, these are just tools. They serve a purpose, to help users, and have little value on their own.

But this extreme argument that scaling and performance don't matter is clearly wrong. People don't like to wait and they don't like outages. Getting what people need quickly and reliably is an important part of the user experience. Scaling does matter.

See also Om Malik's post, "The Web 2.0 hit by outages".

Update: Several months later, one of the first blog search engines, Daypop, goes offline because of scaling issues. The site says, "Daypop no longer has enough memory to calculate the Top 40 and other Top pages ... Daypop won't be back up until a new search/analysis engine is in place." Daypop has been down for a few months since this message was posted.

Update: Sixteen months later, in an interview, Twitter Developer Alex Payne says:
Twitter is the biggest Rails site on the net right now. Running on Rails has forced us to deal with scaling issues - issues that any growing site eventually contends with - far sooner than I think we would on another framework.

The common wisdom in the Rails community at this time is that scaling Rails is a matter of cost: just throw more CPUs at it. The problem is that more instances of Rails (running as part of a Mongrel cluster, in our case) means more requests to your database. At this point in time there's no facility in Rails to talk to more than one database at a time.

The solutions to this are caching the hell out of everything and setting up multiple read-only slave databases, neither of which are quick fixes to implement. So it's not just cost, it's time, and time is that much more precious when people can['t] reach your site.

None of these scaling approaches are as fun and easy as developing for Rails. All the convenience methods and syntactical sugar that makes Rails such a pleasure for coders ends up being absolutely punishing, performance-wise. Once you hit a certain threshold of traffic, either you need to strip out all the costly neat stuff that Rails does for you (RJS, ActiveRecord, ActiveSupport, etc.) or move the slow parts of your application out of Rails, or both.

10 comments:

Anonymous said...

You're absolutely right. Friendster used the same centralized DB model that architecturally limits most of the web2.0 startups, and it cost them dearly. Being able to scale doesn't mean overprovisioning or overdesign when you haven't figured your product out and don't have any users. It does mean architecting your system in a way that it's at least possible to scale, when the traffic comes and you have funds to pay for more servers. But many services are built in ways that makes them very difficult to scale once they hit a success curve.

Greg Linden said...

Hi, Matt. Sorry, I don't agree.

David starts by saying that startups don't need 99.999% uptime. This is clearly a strawman, so I ignored it. No one is arguing that startups need 99.999% uptime (an outage of no more than 5 minutes per year), not Jeremy Wright, not anyone.

But then David says "don't scale" and says it repeatedly. He goes on to argue that startups shouldn't even try to get "two 9's", meaning 99% uptime (15 minute outage every day). He says that startups should deal "with problems as they arise," meaning after the performance problems show up to users. In the comments, he says that he says he "totally" disagrees "that you have to think about scalability on day 1 or it'll be more expensive later", slandering it as "fortune telling", and claims that using LAMP is sufficient.

It's dangerous advice. Scaling and performance should not be an afterthought. It should not be something you deal with only when your site starts melting down.

Jeremy Wright's post (which David says "purports a common misconception") does basically have it right. Jeremy says that startups shouldn't try for 99.9% uptime, but that people should think a bit about scaling and performance instead of the common pattern of getting traffic, then melting down, then furiously trying to rewrite all the code to dig themselves out of the hole.

My recommendation is modest, to always code thinking about whether the system would still work if it had x10 the load. Refactor and rearchitect as necessary but also preemptively. Don't spend a huge amount of time on it, but don't paint yourself into a corner either.

Finally, Matt, I disagree with your last point, that energy spent on scaling is unnecessary and takes away from the user experience. A responsive, reliable website is part of the user experience. People don't like to wait and they don't like outages. Scaling does matter.

Greg Linden said...

By the way, Jeremy Wright has a good followup post, "Does scaling matter?"

Anonymous said...

Greg, I read the Bloglines interview where Mark says it is difficult to build scalability before you know what it looks like. How do you respond to the scaling discussion in this article:

http://www.niallkennedy.com/blog/archives/2006/05/mark-fletcher-bloglines-onelist.html#transcript

Greg Linden said...

Hi, Rob. I think Mark and I agree on this.

Mark's quote was, "I think it's more important to get something out there, initially, as soon as possible, as opposed to get something out there that scales perfectly but may take an additional year ... You may not even know what the usage pattern of your site is until after it starts getting used. So you don't even know what needs to be scaled and what doesn't."

You don't want to take an additional year. You don't want to do premature optimization of code paths that may be unimportant. On the other hand, it is worth spending a little time thinking about scaling and performance. And Mark did.

For example, in a different interview, Mark said, "The vast majority of data within Bloglines ... are stored in a data storage system that we wrote ourselves. This system is based on flat files that are replicated across multiple machines ... We make extensive use of memcached to try to keep as much data in memory as possible to keep performance as snappy as possible."

Mark is thinking about scaling and performance. He went as far as to build a custom data store for performance reasons. But, he's not willing to spend a lot of time or money on it.

Dossy Shiobara said...

Web 2.0: Web 1.0 without the uptime.

Anonymous said...

If your starting a new company shouldn't the business plan give you a reasonable estimate of the expected traffic? I can see getting caught by surprise, but that's very different from not doing your homework.

If you know your building a system for 1 million users, why would you NOT build a system for 1 million users? For any modestly successful Web 2.0 site, scalability is one of the base requirements. Throwing money at it later doesn't necessarily fix the issue (and it could be too late). Someone one told me that it was easy to spend $1.05 to make a $1.

Anonymous said...

Yes, but then someone wrote a 75 line Ruby program and Twitter could then use multiple databases, and it took a single day.

So basically Twitter was just groaning. If they new what abstractions they were using, and had an idea of the limitations of them, they could have solved the problem themselves. In fact it would have taken less time to solve the problem than complain about it.

Not to mention MySQL clustering, which existed before they had problems.

The real moral of the story is that scaling does not matter, as long as you know how to scale.

Anonymous said...

Yes I agree Twitter was just moaning... I think the bigger problem with startups and scaling is not having anyone who can respond to scaling issues.

You don't needa lot of tech knowhow to write a Ruby on Rails app, but when shit hits the fan so to speak you better have someone who can handle this.

It's OK to be obsessed with scaling - as long as when you start to write your app you don't have to worry about it so much :) I doubt every piece of code in Amazon has to obsess over scale - I bet it's built on top of some very efficient architecture.

Anonymous said...

You know I think "ignore scaling" can be replaced by another axiom that may be equally contentious but one which I think overlaps and which I have a lot more fun defending.

That axiom is "count on re-coding from the ground up every 6 months or as needed". This isn't nearly as burdensome as it sounds like it would be. It's a great software development practice if it's done right and when it's applied in the right situations (namely when your application has fewer than 10,000 features (MS Ofiice go home!)).