Saturday, October 06, 2007

Starting Findory: Hardware go boom

Computers go down, a lot more often than we might like.

For most of Findory's four years, it ran on six servers. In that time, those servers had one drive failure, one bad memory chip, and four power supplies fail.

Of these, only the two power supply failures caused outages on the site, one of one hour and one a painful eight hour outage. There were a few other short outages due to network problems in the data center, typically problems that took the entire data center offline temporarily.

Findory's six servers were all cheap commodity Linux boxes, typically a single core low-end AMD processors, 1G of RAM, and a single IDE disk. Findory was cheap, cheap, cheap.

The reliability of Findory over its lifetime was perfectly acceptable, even quite good compared to other little startups with similar levels of traffic, but I think it is interesting to think about what may have been able to prevent the outages without considerable expense.

Surprisingly, RAID disks would not have helped much, though they would have made it easier to recover from the one drive failure that did occur on a backend machine. Better redundancy on the network may have helped, but would have been expensive. Splitting servers and replicating across data centers may have helped, but would have been both expensive and complicated for a site of Findory's size and resources.

Probably the biggest issue was that Findory did not have a hot standby running on its small database. A hot standby database would have avoided both the one hour outage and the eight hour outage. Those outages were caused by losing the first, then a second power supply on the database machine.

Looking back at all of these, I think it was particularly silly not to have the database hot standby. The cost of that would have been minimal and, not only would it have avoided the outage, but it would have reduced the risk of data loss by having a constant database backup. I almost added the hot standby many times, but kept holding off on it. While I may have mostly gotten away with it, it clearly was a mistake.

Other thoughts? Anyone think it was foolish not to run in RAID and not to be split across data centers? Or, maybe you have the opposite opinion, that small startups should not worry much about uptime and small outages are just fine? How about the hardware failure rates, are they typical in your experience?

Please see also my other posts in the "Starting Findory" series.

8 comments:

Anonymous said...

Not sure I'd classify it as "foolish," but I've always gone with RAID simply because it's relatively inexpensive and it means that a failed drive goes from "crap, I need to get out to the data center *now*" to "crap, I need to go out to the data center at some point later tonight," which can be incredibly valuable when you're trying to focus.

codeslinger said...

Since Findory is dead and gone now, are you going to be posting your database design and schema and whatnot? I'd be particularly interested in a couple of things:

1. Why you were using an RDBMS in the first place

2. What kind of schema and table layout you had for that kind of workload

I'm +1 for a post-op database design/internals post for Findory in case a poll should be necessary ;-)

Greg Linden said...

Hi, Codeslinger. To be clear, Findory was only using MySQL for storing users' histories, which is frequently written data.

Almost all of the other data was read-only, stored in Berkeley DB, and replicated locally.

The schema was pretty simple and boring. More relevant might be that Findory was using MyISAM, not InnoDB, which made the queries on the database very fast but less reliable.

If you have not seen it already, there are a few more details in a post a few days ago, "Starting Findory: Infrastructure and scaling".

Joshua said...

Why not get servers with redundant power supplies. Just like RAID, fairly inexpensive, and eliminate a common class of downtime-causing failure.

Of course you're right, not having at least a "warm" standby for your single customer database seems like a rookie mistake :-)

burtonator said...

With six servers you won't have many hardware failures. The MTBF of a site with any component breaking the site is the median MTBF of all the components / the number of components.

So if you just have a few servers you're mostly fine. It's when you have lots where things come into play and engineering costs rise because you have to build in fault tolerance.

In 5-10 more years this stuff will all be done in software though

Nagender Parimi said...

Greg,
On a related note, it'd be great to hear more about your experience with MySQL. I have used it sparingly myself, but it was in a high reads and writes environment. My experience was that MySQL did not quite hold up well (and this was MyISAM on machines better than the ones you described in this post). That, coupled with MySQL's non-standard-compliant behavior (e.g. behavior of nulls and default values) has not left me very comfortable with using MySQL.
This is a great series of posts by the way.

Anonymous said...

A couple of years ago, yeah hardware was still an expensive item. I can remember back just 5 years ago, you could easily spend $100K on a Netapp storage device.

However nowdays, there's no reason to write half the code that people write. You could put together the muli-master replication that MySQL or PostgreSQL has -- but honestly, why bother? Why do you want to own a more complicated system that pages you twice as much now?

Instead of plowing the effort into a more scalable system that is written in low level languages that are impossible to code in, just buy the right hardware that's at this point - fantastically cheap.

For example, you can buy 1TB of usable space via a StoreVault server (a division of Netapp - http://www.storevault.com/) for $12K. Too expensive for you? Well, they're coming out with a 500GB "Personal Sized" one for $6K in a few months.

Something like a StoreVault has a Raid-DP storage interface (read: crazy raid). What's more, if you buy a second one (yes, even the $6K one), you can actually just have the hardware replicate the data automatically between the two devices and stick the 2nd one in another colo.

Just recently, we bought a dual Quad Core server for $3K. Let me repeat that for those who haven't bought a server as of late. For $3K, you can get *a single machine with TWO Quad Core CPUs*. How insane is that, anyway?

So for $6K + $6K + $3K + $3K == $18K, you could have had all the processing power you needed with site redundancy across multiple data centers. I'm betting that 6 servers probably cost you about $2K per server. Isn't that worth the additional $5K?

Anonymous said...

RAID is a basic redundant solution,the database server should use it.
In addition to RAID,mysql should
be master-slave mode.