Decompilation Possible

Program-Transformation.Org: The Program Transformation Wiki

Is Decompilation Possible?

Almost every week requests for decompilation programs are made in newsgroups (like comp.lang.c), and these are usually replied with: It is not possible! People claim that decompilation is similar to converting a hamburger back into a cow, or unscrambling an omelette back to an egg. Here is a typical FAQ entry from C++-FAQ-Lite, and here is my refutation of it. People even write tech reports on the subject. They are far from the truth.

pigsFromSausages.jpg

Certainly, fully automated decompilation of arbitrary machine-code programs is not possible -- this problem is theoretically equivalent to the Halting Problem, an undecidable problem in Computer Science. What this means is that automatic (no expert intervention) decompilation cannot be achieved for all possible programs that are ever written. Further, even if a certain degree of success is achieved, the automatically generated program will lack meaningful variable and function names as these are not normally stored in an executable file (except when stored for debugging purposes).

However, many useful algorithms have this sort of theoretical limitation. It is not possible to write a computer program that can perfectly route any arbitrary printed circuit board automatically. However, the best ones do a good job that can be improved with some expert intervention. They save an enormous amount of work. Similarly, it is not possible in general to solve problems as simple as the travelling salesman problem. Despite this, salesmen continue to travel! The interesting question is not "can we decompile arbitrary machine code programs?", because the answer is no. The interesting question is rather "what proportion of real-world programs can be decompiled to source code that is useful for various purposes?". The answer to this question is very much open at this point.

Disassembling

Some people believe it is only possible to disassemble a machine code program (recover the assembly sources; this in itself is not a trivial problem), again, due to its equivalence to the Halting Problem. However, in practice, there have been approaches to deal with disassembly and decompilation. The more successful ones to date make use of extra information (e.g. knowledge of the compiler used) or require human input at the hard parts of the disassembly process. It may even be possible for a decompiler to automatically (i.e. with no human input) decompile a large fraction of real-world machine code programs. MikeVanEmmerik has shown in his confirmation report that a machine code disassembler (recover assembly language) has three fundamental problems to solve, while a machine code decompiler (recover high-level language) has seven such problems. Both require the separation of code and data, and of pointers and constants. In effect, decompilation is "merely" an extension of disassembly. If you can produce assembly language source for a program, then you can produce high-level language source for that program with more effort. Note that producing assembly language source for a program is a lot harder than just running a dumb disassembler over the machine code. Compare saving the result from running the tool objdump -d (a dumb disassembler) to running a sophisticated disassembler sich as IDA Pro (an interactive disassembler), taking weeks to manually separate the code and data, adding meaningful names, comments, etc.

The processor

Still skeptical? Consider that the processor, by executing the machine language program, can successfully figure out what to do with any input, to create the correct output. A machine code program is a set of instructions, plus a set of data bytes. Each machine code instruction has semantics - you can define precisely what each instruction does. A high-level language is more abstract than a machine code program - details (such as individual instructions, registers, and so on) are abstracted away. In a way, decompilation is the judicious deleting of unneeded information. Those individual machine semantics can be represented in an intermediate form - that's what the front end of a decompiler does. This internal representation is already a decompilation of sorts - it is a representation of the input program. A series of transformations can remove the machine specific aspects of the program semantics. A technique called propagation (forward substitution) can aggregate the individual semantics into more complex expressions. Unused assignments can be removed. Machine code features such as individual registers disappear with this aggregation. Finally, the intermediate representation can be structured, replacing jump instructions with loops and conditionals. Structuiring is a graph transformation, and is well understood. Although the details are quite tricky, there are no theoretical issues with this phase.

Commercial example

Perhaps the final proof is in the doing. Boomerang (an open source decompiler) has already been used as an aid in real-world decompilation; see What Boomerang has done. Granted, this was a special case, where the client already had some (out dated and incomplete) source code, and Boomerang was not yet then capable of doing much without a great deal of hand editing. Nevertheless, it was useful, and gave a glimpse of a time when decompilers may well become an indispensable tool for binary reverse engineering work.

Results


CategoryDecompilation

Transform.DecompilationPossible moved from Transform.DeCompilationPossible on 13 Feb 2003 - 21:54 by MikeVanEmmerik - put it back