Beautiful Debugging > Minimizing Input

28.6. Minimizing Input

The interesting thing about delta debugging (or any other automation of the scientific method) is that it is very general. Rather than search for causes in a set of changes, you can also search for causes in other search spaces. For instance, you can easily apply delta debugging to search for failure causes in program input, which Ralf Hildebrandt and I did in 2002.

When searching for causes in program input, the program code stays the same: no application of changes, no reconstruction, just execution. Instead, it is the input that changes. Think of a program that works on most inputs, but fails on one specific input. One can easily have delta debugging isolate the failure-inducing difference between the two inputs: "The cause of the web browser crashing is the <SELECT> tag in line 40."

One can also modify the algorithm so that it returns a minimized input: "To have the web browser crash, just feed it a web page containing <SELECT>." In minimized input, every single remaining character is relevant for the failure to occur. Minimized inputs can be very valuable for debuggers because they make things simple: they lead to shorter executions and smaller states to be examined. As an important (and perhaps beautiful) side effect, they capture the essence of the failure.

I once met some programmers who were dealing with bugs in a third-party database. They had very complex, machine-generated SQL queries that sometimes would cause the database to fail. The vendor did not categorize these bugs as high priority because "you are our only customer dealing with such complex queries." Then the programmers simplified a one-page SQL query to a single line that still triggered the failure. Faced with this single line, the vendor immediately gave the bug the highest priority and fixed it.

How does one achieve minimization? The easiest way is to feed dd() with an empty c_pass, and to have a passing test return "pass" only if the input is empty, and "unresolved" otherwise. c_pass remains unchanged while c_fail becomes smaller and smaller with each failing test.

Again, all that is required to isolate such failure causes is an automated test and a means to split the input into smaller pieces—that is, a splitting function that has some basic knowledge about the syntax of the input.