Evaluating correctness of disassemblers

Abstract

The capabilities and limitations of disassembly are not always clearly defined or understood, making it difficult for researchers and reviewers to judge the practical feasibility of techniques based on it. At the same time, dis- assembly is the backbone of research in any static binary analysis, which will benefited if the disassembly is accurate.

The output of a disassembler is used for many different purposes (e.g., debugging and reverse engineering). Therefore, disassemblers represent the first link of a long chain of stages on which any high-level analysis of machine code depends upon. In this project we aim to demonstrate that many disassemblers fail to decode certain instructions and thus that the first link of the chain is very weak.

Related work

There is few very pre-work that directly addresses this problem and based on testing a disassembler w.r.t to some notion of ground truth.

N-version Disassembly: Differential Testing of x86 Disassemblers

Idea is to compare the output of n disassemblers against each other, where one of them is a special instruction decoder the author developed, that leverages the physical CPU to provide accurate results, while the others are eight off-the-shelf disassemblers.
Dis assemblers have 2 phases: (I) instruction decoding and (II) selection of the next instruction of the program to decode. This work is focused only automated testing the first phase only.

Critique

The basis of above comparison is syntactic and works well if the disassemblers decode a single instruction at a time to a mnemonic which can then be compared against each other syntactically (after normalization). But the problem comes if the disassemblers try to decode multiple instructions at a time, because they might generate syntactically different instruction sequences which may (or may not) be semantically equivalent.

Proposal

The proposal is about evaluating the correctness of off-the-shelf disassembler on the basis of the semantic correctness of the their disassembled output of straight line byte sequences.

The pseudo code could be as follows:

Generate a random byte sequence as inputs where the byte sequence must decode to non loop code.
Use the CPU to ensure if the sequence corresponds to valid set of instructions. The method could be in line with what is used in the above work.
If CPU says invalid and that is corroborated by other disassemblers, then skip to (1).
If the CPU and a disassembler D disagree on the validation decision, the D is buggy. Go to (1)
For each pair of dissemblers D1 and D2,

If the decoded instruction sequences matches syntactically, skip to (5)
Use Z3 to formally verify the two decoded instruction sequences using the semantics generated by strata.
- If Z3 result to EQUIVALENT, skip to (5)
- If Z3 result to NOT EQUIVALENT, use the counter example to find if either of D1 or D2 or both are buggy by considering CPU execution as the ground truth.
- If Z3 result to CANT PROOF, skip to (5)

At the end we might have different equivalent classes of disassemblers. The one with highest cardinality will determine the currently functioning class for the input random byte sequence.
Repeat with (1).

Provide feedback

Saved searches