Friday, July 22, 2005

Statistical bug isolation

Warning: If you read this weblog only because you are interested in personalization and not because you're a computer science geek, you probably won't be interested in this post.

One of the few nice things about flying around the country is that I get to catch up on some of my reading.

Alex Aiken from Stanford CS will be giving a talk on Cooperative Bug Isolation at University of Washington. It sounds really cool:
This talk presents techniques for the systematic monitoring of thousands to millions of distributed program executions. We discuss how to exploit this information in a particular application: Using partial information gathered from many program executions to automatically isolate the causes of bugs.
As much as I'd like to attend, I won't be able to, so I pulled up one of his papers instead, Scalable Statistical Bug Isolation.

The paper is great. It's a simple but really compelling idea. Instrument the code with predicates (e.g. (x == NULL)). Run the code in production watching for crashes or other failure scenarios. Do credit assignment back to the predicates to determine the cause of the crashes.

Sounds easy, but there's a few nice tricks in here. They adaptively sample the data based on frequency of execution, making the performance impact of the instrumentation undetectable. The credit assignment is much harder than you might think at first -- as they explain in depth -- but they discover a nifty way of floating the root cause of the bugs to the top, eliminating redundancies, and avoiding predicates that indicate behaviors caused by multiple bugs or just correlated (but not causal) to the bugs.

What I found particularly compelling was that they were able to detect the root cause of bugs that caused heap corruption. As most software engineers know, heap corruption is painful to debug because, by the time the program crashes, the stack trace may be nowhere near the code that originally corrupted the heap. Many systems, including Microsoft Windows, rely on sending stack traces back from users to help diagnose bugs that made it into production, but the stack trace usually doesn't give you the data you need to track memory corruption bugs down. Their technique does.

Anyway, if you're a computer dork like me and feel like geeking out, the paper is worth downloading. It's a good read.

Update: Professor Ben Liblit (first author on the paper) contacted me about this post to recommend another paper, "Bug Isolation Via Remote Program Sampling", that covers the technical details of how they instrument the programs and to mention explicitly that people can help out their effort by downloading and using some of their instrumented binaries for Fedora Linux.

No comments: