Monday, August 13, 2012

Fixing Bugs - if you can't reproduce them, you can't fix them




Fixing Bugs - if you can't reproduce them, you can't fix them:
"Generally, if you can’t reproduce it, it’s almost impossible to fix".

Anonymous programmer, Practices of Software Maintenance, Janice Singer

Fixing a problem usually starts with reproducing it – what Steve McConnell calls “stabilizing the error”.

Technically speaking, you can’t be sure you are fixing the problem unless you can run through the same steps, see the problem happen yourself, fix it, and then run through the same steps and make sure that the problem went away. If you can’t reproduce it, then you are only guessing at what’s wrong, and that means you are only guessing that your fix is going to work.

But let’s face it – it’s not always practical or even possible to reproduce a problem. Lots of bug reports don’t include enough information for you to understand what the hell the problem actually was, never mind what was going on when the problem occurred – especially bug reports from the field. Rahul Premraj and Thomas Zimmermann found in The Art of Collecting Bug Reports
(from the book Making Software), that the two most important factors in determining whether a bug report will get fixed or not are:

  1. Is the description well-written, can the programmer understand what was wrong or why the customer thought something was wrong?
  2. Does it include steps to reproduce the problem, even basic information about what they were doing when the problem happened?
It’s not a lot to ask – from a good tester at least. But you can’t reasonably expect this from customers.

There are other cases where you have enough information, but don’t have the tools or expertise to reproduce a problem – for example, when a pen tester has found a security bug using specialist tools that you don’t have or don’t understand how to use.

Sometimes you can fix a problem without being able to see it happen in front of you, come up with a theory on your own, trusting your gut – especially if this is code that you recently worked on. But reproducing the problem first gives you the confidence that you aren’t wasting your time and that you actually fixed the right thing. Trying to reproduce the problem should almost always be your first step.

What’s involved in reproducing a bug?


What you want to do is to find, as quickly as possible, a simple test that consistently shows the problem, so that you can then run a set of experiments, trace through the code, isolate what’s wrong, and prove that it went away after you fixed the code.

The best explanation that I’ve found of how to reproduce a bug is in Debug It! where Paul Butcher patiently explains the pre-conditions (identifying the differences between your test environment and the customer’s environment, and trying to control as many of them as possible), and then how to walk backwards from the error to recreate the conditions required to make the problem happen again. Butcher is confident that if you take a methodical approach, you will (almost) always be able to reproduce the problem successfully.

In Why Programs Fail: A guide to Systematic Debugging, Andreas Zeller, a German Comp Sci professor, explains that it’s not enough just to make the problem happen again. Your goal is to come up with the simplest set of circumstances that will trigger the problem – the smallest set of data and dependencies, the simplest and most efficient test(s) with the fewest variables, the shortest path to making the problem happen. You need to understand what is not relevant to the problem, what’s just noise that adds to the cost and time of debugging and testing – and get rid of it. You do this using binary techniques to slice up the input data set, narrowing in on the data and other variables that you actually need, repeating this until the problem starts to become clear.

Code Complete’s chapter on Debugging is another good guide on how to reproduce a problem following a set of iterative steps, and how to narrow in on the simplest and most useful set of test conditions required to make the problem happen; as well as common places to look for bugs: checking for code that has been changed recently, code that has a history of other bugs, code that is difficult to understand (if you find it hard to understand, there’s a good chance that the programmers who worked on it before you did too).

Replay Tools


One of the most efficient ways to reproduce a problem, especially in server code, is by automatically replaying the events that led up to the problem. To do this you’ll need to capture a time-sequenced record of what happened, usually from an audit log, and a driver to read and play the events against the system. And for this to work properly, the behavior of the system needs to be deterministic – given the same set of inputs in the same sequence, the same results will occur each time. Otherwise you’ll have to replay the logs over and over and hope for the right set of circumstances to occur again.

On one system that I worked on, the back-end engine was a deterministic state machine designed specifically to support replay. All of the data and events, including configuration and control data and timer events, were recorded in an inbound event log that we could replay. There were no random factors or unpredictable external events – the behavior of the system could always be recreated exactly by replaying the log, making it easy to reproduce bugs from the field. It was a beautiful thing, but most code isn’t designed to support replay in this way.

Recent research in virtual machine technology has led to the development of replay tools to snapshot and replay events in a virtual machine. VMWare Workstation, for example, included a cool replay debugging facility for C/C++ programmers which was “guaranteed to have instruction-by-instruction identical behavior each time.” Unfortunately, this was an expensive thing to make work, and it was dropped in version 8, at the end of last year.

Replay Solutions provides replay for Java programs, creating a virtual machine to record the complete stream of events (including database I/O, network I/O, system calls, interrupts) as the application is running, and then later letting you simulate and replay the same events against a copy of the running system, so that you can debug the application and observe its behavior. They also offer similar application record and replay technology for mobile HTML5 and JavaScript applications. This is exciting stuff, especially for complex systems where it is difficult to setup and reproduce problems in different environments.

No comments:

Wiredwizrd

Morgan Todd Lewistown, PA

Experienced Information Technology Manager with a strong knowledge of technical guidance, IT best practices, security protocols, team leadership, and analyzing business requirements.
Google