Monday, December 19, 2005

The folly of ignoring scaling

David Heinemeier at 37 Signals (and creator of Ruby on Rails) wrote what I thought of as a pretty extreme post two weeks ago, "Don't scale", that argues that startups should ignore scaling and performance.

Ironically, in the following two weeks, many popular Web 2.0 startups have had problems, including a multi-day outage at del.icio.us, an 18+ hour outage at SixApart's blogging service Typepad, performance that has "sucked eggs" at Bloglines, and, as GrabPerf reports, slowness and outages at Technorati, Feedster, BlogPulse, BlogDigger, and Digg.

Stepping back for a second, a toned down version of David's argument is clearly correct. A company should focus on users first and infrastructure second. The architecture, the software, the hardware cluster, these are just tools. They serve a purpose, to help users, and have little value on their own.

But this extreme argument that scaling and performance don't matter is clearly wrong. People don't like to wait and they don't like outages. Getting what people need quickly and reliably is an important part of the user experience. Scaling does matter.

See also Om Malik's post, "The Web 2.0 hit by outages".

Update: Several months later, one of the first blog search engines, Daypop, goes offline because of scaling issues. The site says, "Daypop no longer has enough memory to calculate the Top 40 and other Top pages ... Daypop won't be back up until a new search/analysis engine is in place." Daypop has been down for a few months since this message was posted.

Update: Sixteen months later, in an interview, Twitter Developer Alex Payne says:
Twitter is the biggest Rails site on the net right now. Running on Rails has forced us to deal with scaling issues - issues that any growing site eventually contends with - far sooner than I think we would on another framework.

The common wisdom in the Rails community at this time is that scaling Rails is a matter of cost: just throw more CPUs at it. The problem is that more instances of Rails (running as part of a Mongrel cluster, in our case) means more requests to your database. At this point in time there's no facility in Rails to talk to more than one database at a time.

The solutions to this are caching the hell out of everything and setting up multiple read-only slave databases, neither of which are quick fixes to implement. So it's not just cost, it's time, and time is that much more precious when people can['t] reach your site.

None of these scaling approaches are as fun and easy as developing for Rails. All the convenience methods and syntactical sugar that makes Rails such a pleasure for coders ends up being absolutely punishing, performance-wise. Once you hit a certain threshold of traffic, either you need to strip out all the costly neat stuff that Rails does for you (RJS, ActiveRecord, ActiveSupport, etc.) or move the slow parts of your application out of Rails, or both.

12 comments:

Anonymous said...

You're absolutely right. Friendster used the same centralized DB model that architecturally limits most of the web2.0 startups, and it cost them dearly. Being able to scale doesn't mean overprovisioning or overdesign when you haven't figured your product out and don't have any users. It does mean architecting your system in a way that it's at least possible to scale, when the traffic comes and you have funds to pay for more servers. But many services are built in ways that makes them very difficult to scale once they hit a success curve.

Matt James said...

I'm afraid that I've gotten a different impression from David's post. He isn't saying that you should let yourself sink to extended periods of inoperability. He's just saying that an obsession with uptime is mis-focused when you're starting a new company. That you should worry about getting users and then scale when you need to. I think he's spot on as a lot of the energy spent in scaling unnecessarily is energy taken away from front-facing user experience.

Greg Linden said...

Hi, Matt. Sorry, I don't agree.

David starts by saying that startups don't need 99.999% uptime. This is clearly a strawman, so I ignored it. No one is arguing that startups need 99.999% uptime (an outage of no more than 5 minutes per year), not Jeremy Wright, not anyone.

But then David says "don't scale" and says it repeatedly. He goes on to argue that startups shouldn't even try to get "two 9's", meaning 99% uptime (15 minute outage every day). He says that startups should deal "with problems as they arise," meaning after the performance problems show up to users. In the comments, he says that he says he "totally" disagrees "that you have to think about scalability on day 1 or it'll be more expensive later", slandering it as "fortune telling", and claims that using LAMP is sufficient.

It's dangerous advice. Scaling and performance should not be an afterthought. It should not be something you deal with only when your site starts melting down.

Jeremy Wright's post (which David says "purports a common misconception") does basically have it right. Jeremy says that startups shouldn't try for 99.9% uptime, but that people should think a bit about scaling and performance instead of the common pattern of getting traffic, then melting down, then furiously trying to rewrite all the code to dig themselves out of the hole.

My recommendation is modest, to always code thinking about whether the system would still work if it had x10 the load. Refactor and rearchitect as necessary but also preemptively. Don't spend a huge amount of time on it, but don't paint yourself into a corner either.

Finally, Matt, I disagree with your last point, that energy spent on scaling is unnecessary and takes away from the user experience. A responsive, reliable website is part of the user experience. People don't like to wait and they don't like outages. Scaling does matter.

Matt James said...

Being able to scale is important, I totally agree. No one wants to (as you so eloquently put it) "paint themselves into a corner". I don't think David would advocate that either. Ideally, the adventure into scaling would take place when you know that you'll be finished performance-refining your app at the time when you need it. Of course, this is hardly ever the case as predicting human behavior can be very difficult. To know that your store will hit 1,000 concurrent shoppers within 2 weeks of launch is no easy task. Although David might sensationalize the underlying argument so that it has a greater effect on the audience, I believe his warning is applicable. Generally speaking it is a warning to programmers not to over-architect. You could spent years working on the base architecture for something only to find out within hours after launch that you were wrong. David (and Jason) have both spoken on many occasions to getting real-world data and base changes on that. The mentality is that you can't predict the future and that you need to quickly respond to problems that are actually occurring rather than try to plan for every possible problem in advance.

In essense, the way that David (and Jason) put forth their opinions can sometimes seem arrogant or extreme making any underlying material lose credibility when, in fact, many people would agree with it whole-heartedly if conveyed in a different manner.

I think that the ability to scale and change quickly is built into the abstraction of the app itself. If you only have one focal point that needs to be addressed to move from 1 server to 10, you are in good shape. It's when you have to hit 5 or 20 places just to make something hapen that you run into trouble.

All in all, I think almost everyone will agree that some scalability needs to be addressed, but not to go crazy on it.

Greg Linden said...

By the way, Jeremy Wright has a good followup post, "Does scaling matter?"

Anonymous said...

Greg, I read the Bloglines interview where Mark says it is difficult to build scalability before you know what it looks like. How do you respond to the scaling discussion in this article:

http://www.niallkennedy.com/blog/archives/2006/05/mark-fletcher-bloglines-onelist.html#transcript

Greg Linden said...

Hi, Rob. I think Mark and I agree on this.

Mark's quote was, "I think it's more important to get something out there, initially, as soon as possible, as opposed to get something out there that scales perfectly but may take an additional year ... You may not even know what the usage pattern of your site is until after it starts getting used. So you don't even know what needs to be scaled and what doesn't."

You don't want to take an additional year. You don't want to do premature optimization of code paths that may be unimportant. On the other hand, it is worth spending a little time thinking about scaling and performance. And Mark did.

For example, in a different interview, Mark said, "The vast majority of data within Bloglines ... are stored in a data storage system that we wrote ourselves. This system is based on flat files that are replicated across multiple machines ... We make extensive use of memcached to try to keep as much data in memory as possible to keep performance as snappy as possible."

Mark is thinking about scaling and performance. He went as far as to build a custom data store for performance reasons. But, he's not willing to spend a lot of time or money on it.

Dossy Shiobara said...

Web 2.0: Web 1.0 without the uptime.

Anonymous said...

If your starting a new company shouldn't the business plan give you a reasonable estimate of the expected traffic? I can see getting caught by surprise, but that's very different from not doing your homework.

If you know your building a system for 1 million users, why would you NOT build a system for 1 million users? For any modestly successful Web 2.0 site, scalability is one of the base requirements. Throwing money at it later doesn't necessarily fix the issue (and it could be too late). Someone one told me that it was easy to spend $1.05 to make a $1.

Anonymous said...

Yes, but then someone wrote a 75 line Ruby program and Twitter could then use multiple databases, and it took a single day.

So basically Twitter was just groaning. If they new what abstractions they were using, and had an idea of the limitations of them, they could have solved the problem themselves. In fact it would have taken less time to solve the problem than complain about it.

Not to mention MySQL clustering, which existed before they had problems.

The real moral of the story is that scaling does not matter, as long as you know how to scale.

Anonymous said...

Yes I agree Twitter was just moaning... I think the bigger problem with startups and scaling is not having anyone who can respond to scaling issues.

You don't needa lot of tech knowhow to write a Ruby on Rails app, but when shit hits the fan so to speak you better have someone who can handle this.

It's OK to be obsessed with scaling - as long as when you start to write your app you don't have to worry about it so much :) I doubt every piece of code in Amazon has to obsess over scale - I bet it's built on top of some very efficient architecture.

Anonymous said...

You know I think "ignore scaling" can be replaced by another axiom that may be equally contentious but one which I think overlaps and which I have a lot more fun defending.

That axiom is "count on re-coding from the ground up every 6 months or as needed". This isn't nearly as burdensome as it sounds like it would be. It's a great software development practice if it's done right and when it's applied in the right situations (namely when your application has fewer than 10,000 features (MS Ofiice go home!)).