Skip to content

Empirical study of decompilation primitives

Sandeep Dasgupta edited this page Oct 10, 2017 · 2 revisions

Empirical study of the recovered attributes based on their applicability or effectiveness on specific clients

Over the last couple of decades, a large amount of research has been carried out in recovering \emph{richer} source level attributes (like variable offset, type, function prototype information) from binary, but little has been established on the effectiveness or usability of those on specific clients. In this work we are planning to to experimentally evaluate the effect of the recovered attributes (with varying precision) on various clients like pointer analysis or source level symbolic executors, e.g. KLEE. This work broadly aims at finding the minimal set of attributes and precision of each of those attributes, which can optimally benefit a particular client.

One way to conduct the study is to extract the high level attributes from binary at various precision levels and then to establish the benefit of those levels based on a particular client. For extracting a particular attribute, we may either implement the relevant binary analysis from the existing literature or we may assume that the binary is shipped with debug information containing that attribute. For the purposes of the empirical study, we are using debug information which is a practical choice for a one-time experiment instead of implementing a whole bunch of binary analyses. Lifting this debug information to LLVM IR will provide us with the most precise and richest level of IR w.r.t that attribute. Also we can create less precise version of attributes ( and hence the lifted IR) by some variations of the most precise one. There are few engineering challenges involved in lifting process as well. For example, McSema maintained a in-memory structure for all the hardware registers and lifting the debug information require to replace the individual members of the aggregates with typed variables.

This study will motivate which kinds of information we need to extract from binaries and which of those need new research.

Minimal compiler-independent annotations to reconstruct high-quality IR

There are cases where annotations shipped along with binary might be useful in binary analysis to extract information which the current state of the art techniques failed to extract or give imprecise results. The above mentioned study will help in hypothesizing the attribute information we really need for a specific application to provide meaningful results.

For example, accurately identifying records (e.g. C structs) in the stack might be empirically proved to be useful for a particular malware analysis. But even the state of the art binary analysis techniques find it very challenging to recover those when they are only accessed through the base pointer of the stack frame in which case even the member variables of the record can be correctly identified, but the record that stores them becomes indistinguishable. Coming up with minimal annotations which help in precise extraction of these attributes is one of the research goals in this proposal.

Implementation

Our goal is not to the debug information to mimic an existing approach (like TIE or second write) because of the following:

  1. Unavailability of implementation.
  2. In approaches which use a type lattice (like TIE), types are recovered based on solving some constraints to reach some point in the type lattice. How to get that precision without running the same analysis.
  3. Works like secondWrite and tie report metrics like conservatieness and precision on a particular benchmark.

The better option is to look for custom precision levels which may overlap partially or fully with existing work.

Attributes to be recovered

  • Per procedure physical stack frames
  • Stack Variables
  • Stack Variable with Types

We can have the following variation on Types

  • Signed Integer (signed/unsigned), float, pointers

    • A program could move a 32 bit integer variable one Byte at a time using four instructions
    • In a 32 bit architecture it could move a C long long variable (64b) using two instructions that move 4B at a time.
    • There exist functions (e.g.,memcpy ) that copy a source memory range to a destination memory range without regard to the internal structure of the source memory range.
    • Treating pointers as integers
  • Records [full/partial]

    • Partial: The idea comes from the fact that most of the analysis able to recover those fields which are accessed in the program. The rest become invisible. The idea is to find a application where the invisible fields adds some value.
      • Treat a n field struct as n individual variables.
      • some grouping of those n fields
  • Arrays [full/partial]

    • Consider an int array[n] as n variables each of size int.

    Loop code: for(i = 0; i < 64; i++) arr1[i] = i;

    And two possible executions:

     addr = arr1;
     addr = arr1;
     for(i = 0; i < 64; i++){
       *addr = i;
       addr += 1;
     }
    
     for(i = 0; i < 16; i++) {
       *addr = i<<2;
       *(addr + 1) = i<<2+1;
       *(addr + 2) = i<<2+2;
       *(addr + 3) = i<<2+3;
        addr += 4;
     }
    

    You may infer type as: an array of single fields or an array of structs having 4 fields.

  • Unions

  • Recursive Types ?? Existing work use dynamic shape analysis.

  • Function Types

Evaluation Metrices

  • Run Mcsema lifted IR with stack variable recovered and / or recovered types with highest opt level and compared the runtime with a McSema lifted IR without those.
  • Recovering stack variable might help in identifying the induction variable and hence help in parallelization.
  • Pointers for vulnerability and exploit analysis
  • Bug finding and debugging tools like KLEE and ESD [2] cannot work on such program

Results

Close Encounters with Symbolic Execution shows that it is possible to make the reads symbolic even without extracting the variables!

Close Encounters with Symbolic Execution (Part 2) shows how to build klee for the llvm version which McSema emits.

Reference

  1. The paper, 2016, CSUR, is a survey on work of type inferencing binary code.
  2. Zamfir and G. Candea. Execution synthesis: A techniqu e for automated debugging. In ACM EuroSys European Conf. on Com-puter Systems , 2010
  3. Sinz, S. Falke, and F. Merz. A precise memory model for low-level bounded model checking. In 5th Intl. Workshop on Systems Software Verification , 2010