Sunday, August 28, 2005

Findory and the geek detective

Findory pages us if she's feeling unhappy. This happens surprisingly infrequently, every 2-3 months or so, which I think is pretty remarkable for a website this complex.

But it does happen. Unfortunately, last night the alarm bells went off.

When the website is ill, the cause isn't always immediately obvious. Sometimes, you have to play detective, gather clues, find evidence, and build a case that fits all the facts.

Last night was one of those times. Findory was intermittently responsive. Some requests would hang until they timed out, others would get a web page back.

Huh. What's going on? The first step is to find a reproducible test case, something that reliably hangs and can be debugged. I found one, a page that, at the time, seemed to hang every time I loaded it.

This first piece of evidence would seem to point to a problem with our web servers. If other pages load but this page does not, something that happens in the act of generating this page must be responsible.

But what is it? At this point, I pulled out the debugger and walked through what Findory does when it generates that particular page. After a while, I isolated the problem to a line where Findory connects to a remote database.

Now, this is a little strange. We have two pieces of evidence that are somewhat contradictory. Hanging on a specific page would seem to indicate a problem with something particular to that page. But debugging the issue seems to point to the database connection, something that happens on most or all pages. Hmm...

Still suspicious that the problem had something to do with that page and the data on that page, I started looking at the data used on that page. Any errors in the database logs? No. Any evidence of database corruption? No, the tables and files checked fine.

Could it be network trouble? Ping times were good, no packet loss. DNS lookups seemed to be succeeding just fine in a couple quick tests. Hmm...

I need more clues. I look in the error logs for the website. Are the errors occurring just on some pages or across many pages? The errors seem to be happening on many pages, more than I realized at first. This piece of evidence combined with the debugging trace would seem to point firmly at the database.

Now, I try connecting to the database from several remote boxes that have access. Huh, they're hanging. This is completely outside of Findory code now, just database client to database server, and the connection is hanging. It is definitely the connection to the database.

So, not Findory code, not database corruption, but new connections to the database are hanging. Why?

Could it be running out of connections? No, changing that didn't help and, I realized belatedly, running out of connections would return an error, not hang.

What could it be? Maybe I should check network and DNS again. Network checks out. DNS... oh wait! The first DNS server in /etc/resolv.conf is unpingable! I didn't notice that during my first quick check of DNS since dig lookups worked.

So, we have another clue. Could a bad DNS server cause our database to hang? Where and why is our database doing DNS lookups?

Turns out that our database does reverse DNS lookups for access control. While we weren't using this feature, we also neglected to disable it.

So, the solution turned out to be to remove the ill-behaved DNS server from /etc/resolv.conf, disable reverse DNS lookups in our database, and add some additional error handling to the Findory database layer to make any similar problem easier to debug in the future. An hour of downtime, but we're back.

In retrospect, there were a couple things I could have done to debug this faster. First, I should have suspected network and DNS earlier; 4 of the 6 outages we've ever had have been due to network or DNS problems in our datacenter, not a problem isolated to our code. Second, a netstat on the database server might have shown hanging DNS connections, pointing to the problem much earlier.

Good to have the problem resolved. Frankly, if we didn't feel such pressure to get Findory back up for our readers quickly, this kind of detective work almost would be fun. But that's just the inner geek talking.


Jeff said...

MySQL handles DNS problems very poorly. Turn it off and don't use any wildcard host restrictions if you want to sleep at night.

Robert Cottrell said...

Greg, great post on troubleshooting the website difficulties. I aggree that this kind of detective work can be a lot of fun, especially after the problem has been solved and your nerves have had a chance to calm down.

I'm surprised by how many times DNS issues have bitten the sites that I've worked on. And no matter how many times we have had to deal with it, looking for DNS failures is usually one of the last things we think of trying.