Computers go down, a lot more often than we might like.
For most of Findory's four years, it ran on six servers. In that time, those servers had one drive failure, one bad memory chip, and four power supplies fail.
Of these, only the two power supply failures caused outages on the site, one of one hour and one a painful eight hour outage. There were a few other short outages due to network problems in the data center, typically problems that took the entire data center offline temporarily.
Findory's six servers were all cheap commodity Linux boxes, typically a single core low-end AMD processors, 1G of RAM, and a single IDE disk. Findory was cheap, cheap, cheap.
The reliability of Findory over its lifetime was perfectly acceptable, even quite good compared to other little startups with similar levels of traffic, but I think it is interesting to think about what may have been able to prevent the outages without considerable expense.
Surprisingly, RAID disks would not have helped much, though they would have made it easier to recover from the one drive failure that did occur on a backend machine. Better redundancy on the network may have helped, but would have been expensive. Splitting servers and replicating across data centers may have helped, but would have been both expensive and complicated for a site of Findory's size and resources.
Probably the biggest issue was that Findory did not have a hot standby running on its small database. A hot standby database would have avoided both the one hour outage and the eight hour outage. Those outages were caused by losing the first, then a second power supply on the database machine.
Looking back at all of these, I think it was particularly silly not to have the database hot standby. The cost of that would have been minimal and, not only would it have avoided the outage, but it would have reduced the risk of data loss by having a constant database backup. I almost added the hot standby many times, but kept holding off on it. While I may have mostly gotten away with it, it clearly was a mistake.
Other thoughts? Anyone think it was foolish not to run in RAID and not to be split across data centers? Or, maybe you have the opposite opinion, that small startups should not worry much about uptime and small outages are just fine? How about the hardware failure rates, are they typical in your experience?
Please see also my other posts in the "Starting Findory" series.