Cr4sh / openreil

Open source library that implements translator and tools for REIL (Reverse Engineering Intermediate Language)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OpenREIL

OpenREIL is open source library that implements translator and tools for REIL (Reverse Engineering Intermediate Language).

About

REIL was initially developed by Zynamics as part of their BinNavi framework, proprietary code analysis software written in Java. To learn more about REIL read the following documents:

  • «REIL − The Reverse Engineering Intermediate Language» (link)

  • «The REIL language» (part 1, part 2, part 3, part 4)

  • «Applications of the Reverse Engineering Language REIL» (PDF)

  • «REIL: A platform-independent intermediate representation of disassembled code for static code analysis» (PDF)

However, after Zynamics was acquired by Google they abandoned BinNavi, so, I decided to develop my own implementation of REIL. I made it relatively small and portable in comparison with original, the translator itself is just a single library written in C++, it can be statically linked with any program for static or dynamic code analysis. The higher level API of OpenREIL is written in Python, so, it can be easily utilized in plugins and scripts for your favourite reverse engineering tool (almost all modern debuggers and disassemblers has Python bindings).

OpenREIL is not a 100% compatible with Zynamics REIL, it has the same ideology and basics, but there's some changes in IR instruction set and representation of the traget hardware platform features.

OpenREIL is based on my custom fork of libasmir − IR translation library from old versions of BAP framework. I removed some 3-rd party dependencies of libasmir (libbfd, libopcodes) and features that irrelevant to translation itself (binary files parsing, traces processing, etc.), than I did some minor refactoring/bugfixes, switched to Capstone for instruction length disassembling and implemented BAP IR → REIL translation logic on the top of libasmir.

Because libasmir uses VEX (production-quality library, part of Valgrind), full code translation sequence inside of OpenREIL is looks as binary → VEX IR → BAP IR → REIL. It's kinda ugly from engineering point of view, but it allows us to have a pretty robust and reliable support of all general instructions of x86. Current version of OpenREIL still has no support of other architectures, but I'm working on x86_64 and ARMv7.

UPDATE: Current version of OpenREIL already has much or less usable ARMv7 support with dynamic switching of translation context between ARM and Thumb execution modes. There is no proper documentation on ARM yet, but you can check some unit tests for this functionality as usage examples: 1, 2, 3

Please note, that currently OpenREIL is a far away from stable release, so, I don't recommend you to use it for any serious purposes.

Installation

OpenREIL supports Linux, Mac OS X and Windows. Structure of the project source code directory:

  • VEX − VEX source code with BAP patches applied.
  • capstone − Capstone engine source code will be downloaded here.
  • docs − documentation directory.
  • libasmir − libasmir source code.
  • libopenreil − OpenREIL translator source code.
  • pyopenreil − OpenREIL Python API.
  • tests − unit tests launcher and stand-alone test programs.

Linux and OS X

To build OpenREIL under *nix operating systems you need to install git, gcc, make, Python 2.x with header files, NumPy and Cython. After that you can run configure and make just as usual.

Example for Debian:

$ sudo apt-get install git gcc make python python-dev python-numpy cython
$ git clone https://github.com/Cr4sh/openreil.git
$ cd openreil
$ ./autogen.sh
$ ./configure
$ make
$ make test
$ sudo make install

If you planning to use MongoDB as IR code storage you need to install some additional dependencies:

$ sudo apt-get install mongodb python-pymongo python-setuptools
$ sudo easy_install cachetools

OpenREIL can be used not only for debugger/disassembler plugins, but also for standalone code analysis tools. For loading of executable files it uses pybfd (ELF, Mach-O) and pefile (Portable Execute) Python libraries:

$ sudo apt-get install libbfd-dev
$ sudo easy_install pefile pybfd

Windows

Building OpenREIL on Windows requires MinGW build environment and all of the dependencies that was mentioned above.

You also can download compiled Win32 binaries of OpenREIL:

IR format

Available instructions

REIL language traditionally uses three address form of IR instruction where arguments a and b are source, and c is destination.

Example of IR code for mov eax, ecx x86 instruction:

;
; asm: mov eax, ecx
; data (2): 89 c8
;
00000000.00     STR         R_ECX:32,                 ,          V_00:32
00000000.01     STR          V_00:32,                 ,         R_EAX:32

OpenREIL offers 23 different instructions:

Mnemonic Pseudocode Instruction description
STR c = a Store to register
STM *c = a Store to memory
LDM c = *a Load from memory
ADD c = a + b Addition
SUB c = a − b Subtraction
NEG c = −a Negation
MUL c = a * b Multiplication
DIV c = a / b Division
MOD c = a % b Modulus
SHL c = a << b Shift left
SHR c = a >> b Shift right
AND c = a & b Binary AND
OR c = a | b Binary OR
XOR c = a ^ b Binary XOR
NOT c = ~a Binary NOT
EQ c = a == b Equation
LT c = a < b Less than
SMUL c = a * b Signed multiplication
SDIV c = a / b Signed division
SMOD c = a % b Signed modulus
JCC Jump to c if a is not zero
NONE No operation
UNK Untranslated instruction

Each instruction argument can have 1, 8, 16, 32 or 64 bits of length. Most of arithmetic instructions operates with unsigned arguments. SMUL, SDIV and SMOD instructions operates with signed arguments in two’s complement signed number representation form.

Most of instructions supposes the same size of their source and destination arguments, but there’s a several exceptions:

  • Result (argument c) of EQ and LT instructions always has 1 bit size.
  • Argument a of STM and argument c of LDM must have 8 bits or gather size (obviously, there is no memory read/write operations with single bit values).
  • Result of OR instruction can have any size. It may be less, gather or equal than size of it’s input arguments.

OpenREIL uses OR operation for converting to value of different size. For example, converting value of 8 bit argument V_01 to 1 bit argument V_02 (1-7 bits of V_01 will be ignored):

00000000.00      OR           V_01:8,              0:8,           V_02:1

Converting 1 bit argument V_02 to 32 bit argument V_03 (1-31 bits of V_03 will be zero extended):

00000000.00      OR           V_02:1,              0:1,          V_03:32

Instruction argument can have following type:

  • A_REG − CPU register (example: R_EAX:32, R_ZF:1).
  • A_TEMP − temporary register (example: V_01:8, V_02:32).
  • A_CONST − constant value (example: 0:1, fffffff4:32).
  • A_LOC − jump location that consists from machine instruction address and IR instruction number (example: 8048360.0, 100000.3).
  • A_NONE − argument is not used by instruction.

Address of each IR instruction consists from two parts: original address of translated machine instruction and IR instruction logical number (inum). First IR instruction of each machine instruction always has inum with value 0.

Group of IR instructions that represent one machine instruction can set value of some temporary register only once, so, A_TEMP arguments are in static single assignment form within the confines of machine instruction.

Optional instruction flags

In addition to address, operation code and arguments IR instruction also has flags that used to store useful metainformation:

  • IOPT_CALL − this JCC instruction represents a function call.
  • IOPT_RET − this JCC instruction represents a function exit.
  • IOPT_ASM_END − last IR instruction of machine instruction.
  • IOPT_BB_END − last IR instruction of basic block.
  • IOPT_ELIMINATED − whole machine instruction was replaced with I_NONE IR instruction during dead code elimination.

Representation of x86 registers

IR code for x86 architecture operates only with 32-bit length general purpose CPU registers (R_EAX:32, R_ECX:32, etc.), OpenREIL has no half-sized registers (AX, AL, AH, etc.), translator converts all machine instructions that operates with such registers to full length form.

Example of IR code for mov ah, al x86 instruction:

;
; asm: mov ah, al
; data (2): 88 c4
;
00000000.00     AND         R_EAX:32,             ff:8,           V_00:8
00000000.01     AND         R_EAX:32,      ffff00ff:32,          V_01:32
00000000.02      OR           V_00:8,             0:32,          V_02:32
00000000.03     SHL          V_02:32,             8:32,          V_03:32
00000000.04      OR          V_01:32,          V_03:32,         R_EAX:32

Many operating systems uses FS segment register to access certain system structures. OpenREIL represents this segment register as R_FS_BASE:32. Here is an example of function that gets _PEB address of the current Windows process:

ULONG_PTR get_peb(void)
{
#ifdef _X86_

    return __readfsdword(0x30);

#else _AMD64_

    return __readgsqword(0x60);

#endif
}

And IR code of this function:

;
; asm: mov eax, dword ptr fs:[0x30]
; data (6): 64 a1 30 00 00 00
;
00401000.00     STR          R_FS:16,                 ,          V_00:16
00401000.01      OR          V_00:16,             0:16,          V_01:32
00401000.02     ADD     R_FS_BASE:32,            30:32,          V_02:32
00401000.03     STR          V_02:64,                 ,          V_03:32
00401000.04     LDM          V_03:32,                 ,          V_04:32
00401000.05     STR          V_04:32,                 ,         R_EAX:32
;
; asm: ret
; data (1): c3
;
00401006.00     LDM         R_ESP:32,                 ,          V_01:32
00401006.01     ADD         R_ESP:32,             4:32,         R_ESP:32
00401006.02     JCC              1:1,                 ,          V_01:32

Representation of x86 EFLAGS

OpenREIL uses separate 1-bit registers to represent ZF, PF, CF, AF, SF and OF bits of EFLAGS register.

Example of IR code for test eax, eax x86 instruction that sets some of these bits:

;
; asm: test eax, eax
; data (2): 85 c0
;
00000000.00     STR         R_EAX:32,                 ,          V_00:32
00000000.01     STR              0:1,                 ,           R_CF:1
00000000.02     AND          V_00:32,             ff:8,           V_01:8
00000000.03     SHR           V_01:8,              7:8,           V_02:8
00000000.04     SHR           V_01:8,              6:8,           V_03:8
00000000.05     XOR           V_02:8,           V_03:8,           V_04:8
00000000.06     SHR           V_01:8,              5:8,           V_05:8
00000000.07     SHR           V_01:8,              4:8,           V_06:8
00000000.08     XOR           V_05:8,           V_06:8,           V_07:8
00000000.09     XOR           V_04:8,           V_07:8,           V_08:8
00000000.0a     SHR           V_01:8,              3:8,           V_09:8
00000000.0b     SHR           V_01:8,              2:8,           V_10:8
00000000.0c     XOR           V_09:8,           V_10:8,           V_11:8
00000000.0d     SHR           V_01:8,              1:8,           V_12:8
00000000.0e     XOR           V_12:8,           V_01:8,           V_13:8
00000000.0f     XOR           V_11:8,           V_13:8,           V_14:8
00000000.10     XOR           V_08:8,           V_14:8,           V_15:8
00000000.11     AND           V_15:8,              1:1,           V_16:1
00000000.12     NOT           V_16:1,                 ,           R_PF:1
00000000.13     STR              0:1,                 ,           R_AF:1
00000000.14      EQ          V_00:32,             0:32,           R_ZF:1
00000000.15     SHR          V_00:32,            1f:32,          V_17:32
00000000.16     AND             1:32,          V_17:32,          V_18:32
00000000.17      EQ             1:32,          V_18:32,           R_SF:1
00000000.18     STR              0:1,                 ,           R_OF:1

For machine instructions that operates with whole EFLAGS value (pushfd, popfd, etc.) OpenREIL composes it's value from values of 1-bit registers:

Example of IR code for pushfd x86 instruction:

;
; asm: pushfd
; data (1): 9c
;
00000000.00     STR         R_ESP:32,                 ,          V_00:32
00000000.01     SUB          V_00:32,             4:32,          V_01:32
00000000.02     STR          V_01:32,                 ,         R_ESP:32
00000000.03     AND      fffffffe:32,      fffffffb:32,          V_02:32
00000000.04     AND      R_EFLAGS:32,          V_02:32,          V_03:32
00000000.05     AND      ffffffbf:32,      ffffff7f:32,          V_04:32
00000000.06     AND      ffffffef:32,          V_04:32,          V_05:32
00000000.07     AND          V_05:32,      fffff7ff:32,          V_06:32
00000000.08     AND          V_03:32,          V_06:32,      R_EFLAGS:32
00000000.09      OR           R_CF:1,             0:32,          V_07:32
00000000.0a     SHL          V_07:32,             0:32,          V_08:32
00000000.0b      OR           R_PF:1,             0:32,          V_09:32
00000000.0c     SHL          V_09:32,             2:32,          V_10:32
00000000.0d      OR          V_08:32,          V_10:32,          V_11:32
00000000.0e      OR      R_EFLAGS:32,          V_11:32,          V_12:32
00000000.0f      OR           R_AF:1,             0:32,          V_13:32
00000000.10     SHL          V_13:32,             4:32,          V_14:32
00000000.11      OR           R_ZF:1,             0:32,          V_15:32
00000000.12     SHL          V_15:32,             6:32,          V_16:32
00000000.13      OR           R_SF:1,             0:32,          V_17:32
00000000.14     SHL          V_17:32,             7:32,          V_18:32
00000000.15      OR          V_16:32,          V_18:32,          V_19:32
00000000.16      OR          V_14:32,          V_19:32,          V_20:32
00000000.17      OR           R_OF:1,             0:32,          V_21:32
00000000.18     SHL          V_21:32,             b:32,          V_22:32
00000000.19      OR          V_20:32,          V_22:32,          V_23:32
00000000.1a      OR          V_12:32,          V_23:32,      R_EFLAGS:32
00000000.1b     STR      R_EFLAGS:32,                 ,          V_24:32
00000000.1c      OR          V_24:32,           202:32,          V_25:32
00000000.1d     STR       R_DFLAG:32,                 ,          V_26:32
00000000.1e     AND          V_26:32,           400:32,          V_27:32
00000000.1f      OR          V_25:32,          V_27:32,          V_28:32
00000000.20     STR      R_IDFLAG:32,                 ,          V_29:32
00000000.21      OR             15:8,             0:32,          V_30:32
00000000.22     SHL          V_29:32,          V_30:32,          V_31:32
00000000.23     AND          V_31:32,        200000:32,          V_32:32
00000000.24      OR          V_28:32,          V_32:32,          V_33:32
00000000.25     STR      R_ACFLAG:32,                 ,          V_34:32
00000000.26      OR             12:8,             0:32,          V_35:32
00000000.27     SHL          V_34:32,          V_35:32,          V_36:32
00000000.28     AND          V_36:32,         40000:32,          V_37:32
00000000.29      OR          V_33:32,          V_37:32,          V_38:32
00000000.2a     STM          V_38:32,                 ,          V_01:32

Also, as you can see from IR code above, OpenREIL uses 32-bit R_DFLAG register to represent direction flag of x86. This register is used in such instructions like movsb, lodsb, etc. as increment value for source/destination index, so, R_DFLAG must be 1:32 when direction flag is not set, and ffffffff:32 (minus one) when it's set.

Example of IR code for instruction that clears direction flag:

;
; asm: cld
; data (1): fc
;
00000000.00     STR             1:32,                 ,       R_DFLAG:32

... and for instruction that sets direction flag:

;
; asm: std
; data (1): fd
;
00000001.00     STR      ffffffff:32,                 ,       R_DFLAG:32

C API

OpenREIL C API is declared in reil_ir.h (IR format) and libopenreil.h (translator API) header files.

Here is an example of the test program that translates machine code to IR instructions and prints them to stdout:

#include <stdio.h>
#include <libopenreil.h>

#define MAX_ARG_STR 50

const char *inst_op[] =
{
    "NONE", "UNK", "JCC",
    "STR", "STM", "LDM",
    "ADD", "SUB", "NEG", "MUL", "DIV", "MOD", "SMUL", "SDIV", "SMOD",
    "SHL", "SHR", "AND", "OR", "XOR", "NOT",
    "EQ", "LT"
};

int arg_size[] = { 1, 8, 16, 32, 64 };

char *arg_print(reil_arg_t *arg, char *arg_str)
{
    memset(arg_str, 0, MAX_ARG_STR);

    switch (arg->type)
    {
    case A_NONE:

        snprintf(arg_str, MAX_ARG_STR - 1, "");
        break;

    case A_REG:
    case A_TEMP:

        snprintf(arg_str, MAX_ARG_STR - 1, "%s:%d", arg->name, arg_size[arg->size]);
        break;

    case A_CONST:

        snprintf(arg_str, MAX_ARG_STR - 1, "%llx:%d", arg->val, arg_size[arg->size]);
        break;
    }

    return arg_str;
}

void inst_print(reil_inst_t *inst)
{
    char arg_str[MAX_ARG_STR];

    // print address and mnemonic
    printf(
        "%.8llx.%.2x %7s ",
        inst->raw_info.addr, inst->inum, inst_op[inst->op]
    );

    // print instruction arguments
    printf("%16s, ", arg_print(&inst->a, arg_str));
    printf("%16s, ", arg_print(&inst->b, arg_str));
    printf("%16s  ", arg_print(&inst->c, arg_str));

    printf("\n");
}

int inst_handler(reil_inst_t *inst, void *context)
{
    // increment IR instruction counter
    *(int *)context += 1;

    // print IR instruction to the stdout
    inst_print(inst);

    return 0;
}

int translate_inst(reil_arch_t arch, unsigned char *data, int len)
{
    int translated = 0;

    // initialize REIL translator
    reil_t reil = reil_init(arch, inst_handler, (void *)&translated);
    if (reil)
    {
        // translate single instruction to REIL
        reil_translate(reil, 0, data, len);
        reil_close(reil);
        return translated;
    }

    return -1;
}

int main(int argc, char *argv[])
{
    unsigned char *test_data = "\x33\xC0"; // xor eax, eax
    int len = 2;

    if (translate_inst(ARCH_X86, test_data, len) >= 0)
    {
        return 0;
    }

    return -1;
}

Python API

Low level translation API

OpenREIL has low level translation API that returns IR instructions as Python tuple. Here is an example of decoding push eax x86 instruction using this API:

from pyopenreil import translator
from pyopenreil.REIL import ARCH_X86

# create REIL translator instance
tr = translator.Translator(ARCH_X86)

# translate machine instruction to REIL
insn_lst = tr.to_reil('\x50', addr = 0)

for insn in insn_lst:

    # print single IR instruction
    print insn
    

Program output:

((0L, 1), 0, 3, ((1, 3, 'R_EAX'), (), (2, 3, 'V_00')), {0: ('push', 'eax'), 1: ‘\x50'})
((0L, 1), 1, 3, ((1, 3, 'R_ESP'), (), (2, 3, 'V_01')), {})
((0L, 1), 2, 7, ((2, 3, 'V_01'), (3, 3, 4L), (2, 3, 'V_02')), {})
((0L, 1), 3, 3, ((2, 3, 'V_02'), (), (1, 3, 'R_ESP')), {})
((0L, 1), 4, 4, ((2, 3, 'V_00'), (), (2, 3, 'V_02')), {2: 8L})

pyopenreil.translator module is written in Cython, it’s stands for bridge between C API and high level Python API of OpenREIL. Also, OpenREIL uses JSON representation of these tuples to store translated instruction into the file or MongoDB collection.

IR constants (operation codes, argument types, etc.) are declared in pyopenreil.IR module.

High level Python API

High level Python API of OpenREIL has abstractions for IR instructions format, translator, machine instructions reader and IR instructions storage:

OpenREIL Python API diagram

The most important modules:

  • pyopenreil.IR − IR constants.
  • pyopenreil.REIL − translation and analysis API.
  • pyopenreil.symbolic − represents IR as symbolic expressions.
  • pyopenreil.VM − IR emulation.
  • pyopenreil.utils.asm − instruction reader for x86 assembly language.
  • pyopenreil.utils.bin_PE − instruction reader for PE binaries.
  • pyopenreil.utils.bin_BFD − instruction reader for ELF and Mach-O binaries.
  • pyopenreil.utils.kd − instruction reader that uses pykd API.
  • pyopenreil.utils.GDB − instruction reader that uses GDB Python API.
  • pyopenreil.utils.IDA − instruction reader that uses IDA Pro Python API.
  • pyopenreil.utils.mongodb − instruction storage that uses MongoDB.

Usage example:

from pyopenreil.REIL import *

# create in-memory storage for IR instructions
storage = CodeStorageMem(ARCH_X86)

# initialise raw instruction reader
reader = ReaderRaw(ARCH_X86, '\x50', addr = 0)

# create translator instance
tr = CodeStorageTranslator(reader, storage)

# translate single machine instruction
insn_list = tr.get_insn(0)

# enumerate and print machine instruction IR instructions
for insn in insn_list:

    print insn

After machine instruction was translated you can fetch it’s IR instructions from the storage:

insn_list = storage.get_insn(0)

It’s also possible to get a single IR instruction:

# get 2-nd IR instruction of machine instruction at address 0
insn = storage.get_insn(( 0, 1 ))

Copying contents of one storage into another:

# src -> dst
storage_2 = CodeStorageMem(ARCH_X86)
storage.to_storage(storage_2)

# dst <- src
storage_3 = CodeStorageMem(ARCH_X86)
storage_3.from_storage(storage_2)

In memory storage contents also can be saved into the JSON file:

storage.to_file('test.json')

Loading storage contents from JSON file:

# using new instance
storage = CodeStorageMem(ARCH_X86, from_file = 'test.json')

# using existing instance
storage.from_file('test.json')

Instead of raw instruction reader you also can use x86 assembly instruction reader that based on GAS. OpenREIL uses this reader in unit tests, it also might be useful for other purposes:

from pyopenreil.utils import asm

# create reader instance
reader = asm.Reader(ARCH_X86, ( 'push eax' ), addr = 0)

MongoDB as IR code storage

Using of CodeStorageMem() class as IR code storage has one significant disadvantage: real programs are very complex nowadays, so you may need to analyze amount of code that is too big to fit into the RAM.

To solve scalability problem I implemented other storage class that uses MongoDB collections to store translated code, example of it's usage:

from pyopenreil.utils import mongodb

# create storage instance
storage = mongodb.CodeStorageMongo(ARCH_X86,
          collection = 'test_fib', db = 'openreil',
          host = 'localhost', port = 27017) 

# clear old contents of collection
storage.clear()

Here is an example of MongoDB collection record format that is used by OpenREIL:

{ 
  "_id" : ObjectId("54e2a2fad87fe0462f000000"), 
  "addr" : NumberLong(1094795585), 
  "size" : 1
  "inum" : 0, 
  "op" : 3,
  "a" : [ 1, 3, "V_00" ], 
  "b" : [ ], 
  "c" : [ 2, 3, "V_01" ],  
  "attr" : [ [ 2, 0 ] ]
}

This record stores the following IR instruction:

41414141.00     STR          V_00:32,                 ,          V_01:32

MongoDB module uses [('addr', ASCENDING), ('inum', ASCENDING)] as collection index. To improve storage performance you also can join several instances of MongoDB into cluster.

Translating of basic blocks and functions

CodeStorageTranslator class of pyopenreil.REIL is also can do code translation at basic blocks or functions level. Example of basic block translation:

# initialise raw instruction reader
reader = ReaderRaw(ARCH_X86, '\x33\xC0\xC3', addr = 0)

# Create translator instance.
# We are not passing storage instance, translator will create
# instance of CodeStorageMem automatically in this case.
tr = CodeStorageTranslator(reader)

# translate single basic block
bb = tr.get_bb(0)

# print basic block information
print 'addr = %s, size = %d bytes' % (bb.ir_addr, bb.size)

# print address of the first and last instruction
print 'first = 0x%x, last = 0x%x' % (bb.first.addr, bb.last.addr)

Example of function translation:

# translate whole function
func = tr.get_func(0)

# print function information
print 'function at 0x%x, stack args count: %d' % (func.addr, func.stack_args)

# enumerate function code chunks
nun = 0
for ch in func.chunks:

    # print chunk information
    print 'chunk %d: addr = 0x%x, size = %d' % (ch.addr, ch.size)
    num += 1

# enumerate function basic blocks
for bb in func.bb_list:

    # print basic block information
    print bb

Symbolic expressions

OpenREIL can convert basic blocks and instructions into the symbolic form, it might be useful for such tasks like symbolic execution, interacting with SMT solver, decompilation and analysis of code semantics.

Let's write a program that translates assembly function into IR and prints a set of symbolic expressions for it's first basic block:

from pyopenreil.REIL import *
from pyopenreil.utils import asm

# create assembly instruction reader
reader = asm.Reader(ARCH_X86, ( 'push ebp',
                                'mov ebp, esp',
                                'xor eax, eax',
                                'test ecx, ecx', 
                                'jz _quit',
                                'inc eax',
                                '_quit: leave',
                                'ret'), addr = 0)

# create translator instance
tr = CodeStorageTranslator(reader)

# translate first basic block
bb = tr.get_bb(0)

# Get symbolic representation of basic block.
# False value of temp_regs argument indicates that
# we don't need expressions for REIL temp registers.
sym = bb.to_symbolic(temp_regs = False)

# print SymState object
print sym

Program output (symbolic representation of first basic block that contains first 5 machine instructions of the test code):

R_ESP = (R_ESP - 0x4)
*(R_ESP - 0x4) = R_EBP
R_EBP = (R_ESP - 0x4)
R_EAX = 0x0
R_CF = 0x0
R_PF = ~((((((R_ECX & 0xff) >> 0x7) ^ ((R_ECX & 0xff) >> 0x6)) ^ (((R_ECX & 0xff) >> 0x5) ^ ((R_ECX & 0xff) >> 0x4))) ^ ((((R_ECX & 0xff) >> 0x3) ^ ((R_ECX & 0xff) >> 0x2)) ^ (((R_ECX & 0xff) >> 0x1) ^ (R_ECX & 0xff)))) & 0x1)
R_AF = 0x0
R_ZF = (R_ECX == 0x0)
R_SF = (0x1 == (0x1 & (R_ECX >> 0x1f)))
R_OF = 0x0
@IP = ((((R_ECX == 0x0) | 0x0) & 0x1)) ? 0xa : 0x9

Symbolic expressions are implemented in pyopenreil.symbolic module; symbolic.Sym is a base class for other classes that represents expression members: symbolic.SymVal, symbolic.SymConst, symbolic.SymExp, symbolic.SymCond, symbolic.SymPtr and others.

You can get individual expressions from symbolic representation:

from pyopenreil.symbolic import *

# get expression for ZF register value
exp_zf = sym.query(SymVal('R_ZF'))

# prints (R_ECX == 0x0)
print exp_zf

Getting expression that represents instruction pointer value at the end of the basic block:

# get expression for EIP register value
exp_eip = sym.query(SymIP())

# prints ((((R_ECX == 0x0) | 0x0) & 0x1)) ? 0xa : 0x9
print exp_ip

You also can compare expressions, OpenREIL will care about commutativity and other important things. To match any valid expression symbolic.SymAny object is used:

# compare expressions
assert exp_zf == SymExp(I_EQ, SymVal('R_ECX'), SymConst(0, U32))
assert exp_zf == SymExp(I_EQ, SymVal('R_ECX'), SymAny())
assert exp_zf == SymAny()

For extracting information about input and output arguments of symbolic.SymState it has arg_in() and arg_out() methods:

print 'IN:', ', '.join(map(lambda a: str(a), sym.arg_in()))
print 'OUT:', ', '.join(map(lambda a: str(a), sym.arg_out()))

Console output:

IN: R_ESP, R_EBP, R_ECX
OUT: R_ESP, *(R_ESP - 0x4), R_EBP, R_EAX, R_CF, R_PF, R_AF, R_ZF, R_SF, R_OF, @IP

Control flow graphs

OpenREIL can build control flow graph (CFG) of IR code. Let's write some test program in C to demonstrate it:

#include <stdio.h>
#include <stdlib.h>

// computes a member of fibbonnaci sequence at given position
int fib(int n)
{
    int result = 1, prev = 0;

    while (n > 0)
    {
        int current = result;

        result += prev;
        prev = current;

        n -= 1;
    }

    return result;
}

int main(int argc, char *argv[])
{
    int n = 0, val = 0;

    if (argc < 2)
    {
        // input argument was not specified
        return -1;
    }

    n = atoi(argv[1]);
    
    // call test function
    return fib(n);
}

Now compile it with GCC and get virtual address of fib() function using objdump:

$ gcc tests/fib.c -o tests/fib
$ objdump -t tests/fib | grep '.text'
08048360 l    d  .text  00000000              .text
08048390 l     F .text  00000000              __do_global_dtors_aux
080483f0 l     F .text  00000000              frame_dummy
08048540 l     F .text  00000000              __do_global_ctors_aux
08048530 g     F .text  00000002              __libc_csu_fini
08048532 g     F .text  00000000              .hidden __i686.get_pc_thunk.bx
080484c0 g     F .text  00000061              __libc_csu_init
08048360 g     F .text  00000000              _start
0804844b g     F .text  0000006e              main
08048414 g     F .text  00000037              fib

OpenREIL allows to work with CFG's using REIL.CFGraph, REIL.CFGraphNode and REIL.CFGraphEdge objects; REIL.CFGraphBuilder object accepts translator instance and performs CFG traversing starting from specified address and until the end of the current machine code function.

Let's build control flow graph of fib() function (that located at address 0x08048414) of tests/fib executable:

from pyopenreil.REIL import *
from pyopenreil.utils import bin_BFD

# create executable image reader 
reader = bin_BFD.Reader('tests/fib')

# VA of the fib() function inside fib executable
addr = 0x08048414

# create translator instance
tr = CodeStorageTranslator(reader)

# build control flow graph
cfg = CFGraphBuilder(tr).traverse(addr)

You can save generated graph as file of Graphviz DOT format:

cfg.to_dot_file('cfg.dot')

Then you can render this file into the PNG image using Graphviz dot utility:

$ dot -Tpng cfg.dot > cfg.png

Rendered control flow graph of the fib() function:

OpenREIL Python API diagram

Please note, that generated IR code may have more complicated CFG layout than it's source machine code (because of x86 instructions with REP prefix, for example). Here is IR code for function that consists from rep movsb and ret instructions:

;
; asm: rep movsb byte ptr es:[edi], byte ptr [esi]
; data (2): f3 a4
;
00001337.00     STR         R_ECX:32,                 ,          V_00:32
00001337.01      EQ          V_00:32,             0:32,           V_01:1
00001337.02     JCC           V_01:1,                 ,          1339:32
00001337.03     SUB          V_00:32,             1:32,          V_02:32
00001337.04     STR          V_02:32,                 ,         R_ECX:32
00001337.05     STR       R_DFLAG:32,                 ,          V_03:32
00001337.06     STR         R_EDI:32,                 ,          V_04:32
00001337.07     STR         R_ESI:32,                 ,          V_05:32
00001337.08     LDM          V_05:32,                 ,           V_06:8
00001337.09     STM           V_06:8,                 ,          V_04:32
00001337.0a     ADD          V_04:32,          V_03:32,          V_07:32
00001337.0b     STR          V_07:32,                 ,         R_EDI:32
00001337.0c     ADD          V_05:32,          V_03:32,          V_08:32
00001337.0d     STR          V_08:32,                 ,         R_ESI:32
00001337.0e     JCC              1:1,                 ,          1337:32
;
; asm: ret
; data (1): c3
;
00001339.00     STR         R_ESP:32,                 ,          V_00:32
00001339.01     LDM          V_00:32,                 ,          V_01:32
00001339.02     ADD          V_00:32,             4:32,          V_02:32
00001339.03     STR          V_02:32,                 ,         R_ESP:32
00001339.04     JCC              1:1,                 ,          V_01:32

As you can see, this IR code has branch instructions at 1337.02 and 1337.0e, so, it's DFG has 3 nodes:

OpenREIL Python API diagram

Data flow graphs

As example, let's take a simple function that returns sum of two arguments:

long __fastcall sum(long a, long b)
{
    // a comes in ECX, b in EDX
    return a + b;
}

It's Assembly code:

_sum:

mov     eax, ecx
add     eax, edx
ret

Generated IR code:

;
; asm: mov eax, ecx
; data (2): 89 c8
;
00000000.00     STR         R_ECX:32,                 ,          V_00:32
00000000.01     STR          V_00:32,                 ,         R_EAX:32
;
; asm: add eax, edx
; data (2): 01 d0
;
00000002.00     STR         R_EAX:32,                 ,          V_00:32
00000002.01     STR         R_EDX:32,                 ,          V_01:32
00000002.02     ADD          V_00:32,          V_01:32,          V_02:32
00000002.03     STR          V_02:32,                 ,         R_EAX:32
00000002.04     ADD          V_00:32,          V_01:32,          V_03:32
00000002.05      LT          V_03:32,          V_00:32,           R_CF:1
00000002.06     AND          V_03:32,             ff:8,           V_04:8
00000002.07     SHR           V_04:8,              7:8,           V_05:8
00000002.08     SHR           V_04:8,              6:8,           V_06:8
00000002.09     XOR           V_05:8,           V_06:8,           V_07:8
00000002.0a     SHR           V_04:8,              5:8,           V_08:8
00000002.0b     SHR           V_04:8,              4:8,           V_09:8
00000002.0c     XOR           V_08:8,           V_09:8,           V_10:8
00000002.0d     XOR           V_07:8,           V_10:8,           V_11:8
00000002.0e     SHR           V_04:8,              3:8,           V_12:8
00000002.0f     SHR           V_04:8,              2:8,           V_13:8
00000002.10     XOR           V_12:8,           V_13:8,           V_14:8
00000002.11     SHR           V_04:8,              1:8,           V_15:8
00000002.12     XOR           V_15:8,           V_04:8,           V_16:8
00000002.13     XOR           V_14:8,           V_16:8,           V_17:8
00000002.14     XOR           V_11:8,           V_17:8,           V_18:8
00000002.15     AND           V_18:8,              1:1,           V_19:1
00000002.16     NOT           V_19:1,                 ,           R_PF:1
00000002.17     XOR          V_00:32,          V_01:32,          V_20:32
00000002.18     XOR          V_03:32,          V_20:32,          V_21:32
00000002.19     AND            10:32,          V_21:32,          V_22:32
00000002.1a      EQ             1:32,          V_22:32,           R_AF:1
00000002.1b      EQ          V_03:32,             0:32,           R_ZF:1
00000002.1c     SHR          V_03:32,            1f:32,          V_23:32
00000002.1d     AND             1:32,          V_23:32,          V_24:32
00000002.1e      EQ             1:32,          V_24:32,           R_SF:1
00000002.1f     XOR          V_01:32,      ffffffff:32,          V_25:32
00000002.20     XOR          V_00:32,          V_25:32,          V_26:32
00000002.21     XOR          V_00:32,          V_03:32,          V_27:32
00000002.22     AND          V_26:32,          V_27:32,          V_28:32
00000002.23     SHR          V_28:32,            1f:32,          V_29:32
00000002.24     AND             1:32,          V_29:32,          V_30:32
00000002.25      EQ             1:32,          V_30:32,           R_OF:1
;
; asm: ret
; data (1): c3
;
00000004.00     STR         R_ESP:32,                 ,          V_00:32
00000004.01     LDM          V_00:32,                 ,          V_01:32
00000004.02     ADD          V_00:32,             4:32,          V_02:32
00000004.03     STR          V_02:32,                 ,         R_ESP:32
00000004.04     JCC              1:1,                 ,          V_01:32

Now let's construct DFG of this function:

from pyopenreil.REIL import *
from pyopenreil.utils import asm

# create assembly instruction reader
reader = asm.Reader(ARCH_X86, ( 'mov eax, ecx',
                                'add eax, edx', 
                                'ret' ), addr = 0)

# create translator instance
tr = CodeStorageTranslator(reader)

# build data flow graph
dfg = DFGraphBuilder(tr).traverse(0)

Rendered data flow graph picture (link) for this simple code looks quite complex because add instruction (like any other arithmetic instruction of x86) is doing a lot of EFLAGS computations.

REIL.DFGraph allows to apply some basic data flow code optimizations to translated IR code, currently it supports such well known compiler optimizations as dead code elimination, constant folding and very basic common subexpressions elimination.

Let's apply these optimizations to fib() function code:

# run some of available optimizations
dfg.eliminate_dead_code()
dfg.constant_folding()

# store results
dfg.store(tr.storage)

# print optimized IR code
print tr.storage.get_func(0)

Optimized IR code:

;
; asm: mov eax, ecx
; data (2): 89 c8
;
00000000.00     STR         R_ECX:32,                 ,          V_00:32
00000000.01     STR          V_00:32,                 ,         R_EAX:32
;
; asm: add eax, edx
; data (2): 01 d0
;
00000002.00     STR         R_EAX:32,                 ,          V_00:32
00000002.01     STR         R_EDX:32,                 ,          V_01:32
00000002.02     ADD          V_00:32,          V_01:32,          V_02:32
00000002.03     STR          V_02:32,                 ,         R_EAX:32
;
; asm: ret
; data (1): c3
;
00000004.00     STR         R_ESP:32,                 ,          V_00:32
00000004.01     LDM          V_00:32,                 ,          V_01:32
00000004.02     ADD          V_00:32,             4:32,          V_02:32
00000004.03     STR          V_02:32,                 ,         R_ESP:32
00000004.04     JCC              1:1,                 ,          V_01:32

As you can see, all x86 flags computations was removed during optimization, because dead code elimination assumes that these values will not be used after the function exit.

Let's export and view DFG of achieved code:

dfg.to_dot_file('dfg.dot')

OpenREIL Python API diagram

It's obvious to figure, that optimized code is still no prefect: it uses additional 7 temp registers (1 in first instruction, 3 in second and third instructions) to represent our machine code, while only one temp register (V_01:32 inside IR code of ret) is enough.

REIL.DFGraph is also allows to optimize temp registers usage with common subexpressions elimination:

dfg.eliminate_subexpressions()
dfg.store(tr.storage)

print tr.storage

Now translated code looks pretty close to original machine code:

;
; asm: mov eax, ecx
; data (2): 89 c8
;
00000000.00     STR         R_ECX:32,                 ,         R_EAX:32
;
; asm: add eax, edx
; data (2): 01 d0
;
00000002.00     ADD         R_EAX:32,         R_EDX:32,         R_EAX:32
;
; asm: ret
; data (1): c3
;
00000004.00     LDM         R_ESP:32,                 ,          V_01:32
00000004.01     ADD         R_ESP:32,             4:32,         R_ESP:32
00000004.02     JCC              1:1,                 ,          V_01:32

Data flow graph of final code after all optimizations:

OpenREIL Python API diagram

REIL.CFGraph is also has optimize_all() method that runs all available optimizations.

Please note, that you need to specify True value for keep_flags argument of CFGraph.eliminate_dead_code() if you want to keep flag register values at function exit.

It also will be necessary to say, that described optimizations was designed not for defeating code obfuscation or something, but rather for reducing amount of ineffective code produced by libopenreil translator.

Handling of unknown instructions

During static code analysis it will be useful to have ability to get metainformation about machine instruction (source and destination registers, etc.) even if translation of this instruction to IR was unsuccessful. With such ability it will be not necessary to interrupt code analysis on unknown/untranslated machine instruction, you'll just loose some analysis accuracy.

VEX library used in OpenREIL has excellent support of x86 general purpose instructions, but it can't translate to IR some certain system instructions like rdmsr, wrmsr, cpuid, sidt, sgdt, sldt and some others. In this case libopenreil will try to use Capstone engine to get information about registers reads and writes, and pass this information to upper abstraction levels. For representing of unknown/untranslated instructions OpenREIL has I_UNK IR instruction.

Let's try to translate to IR some code that uses cpuid instruction to get information about x86 processor info and features:

from pyopenreil.REIL import *
from pyopenreil.utils import asm

# create assembly instruction reader
reader = asm.Reader(ARCH_X86, ( 'mov eax, 00000001h',
                                'mov ecx, 0',
                                'cpuid',
                                'mov eax, ecx',
                                'ret' ), addr = 0)

# create translator instance
tr = CodeStorageTranslator(reader)

# translate function and build it's DFG
dfg = DFGraphBuilder(tr).traverse(addr)

# Run all available code optimizations and update 
# storage with new function code.
dfg.optimize_all(tr.storage)

# print IR of translated and optimized function
print tr.get_func(0)

Generated IR code:

;
; asm: mov eax, 1
; data (5): b8 01 00 00 00
;
00000000.00     STR             1:32,                 ,         R_EAX:32
;
; asm: mov ecx, 0
; data (5): b9 00 00 00 00
;
00000005.00     STR             0:32,                 ,         R_ECX:32
;
; asm: cpuid -- reads: R_EAX, R_ECX; writes: R_EAX, R_EBX, R_ECX, R_EDX
; data (2): 0f a2
;
0000000a.00     UNK                 ,                 ,
;
; asm: mov eax, ecx
; data (2): 89 c8
;
0000000c.00     STR         R_ECX:32,                 ,         R_EAX:32
;
; asm: ret
; data (1): c3
;
0000000e.00     LDM         R_ESP:32,                 ,          V_01:32
0000000e.01     ADD         R_ESP:32,             4:32,         R_ESP:32
0000000e.02     JCC              1:1,                 ,          V_01:32

As you can see, it has I_UNK instruction with a.00 IR address.

Data flow graph of IR code with I_UNK node that represents cpuid instruction with R_EAX, R_ECX as source and R_EDX, R_ECX, R_EBX as destination arguments:

OpenREIL Python API diagram

IR code emulation

OpenREIL has emulator for IR code, currently engine uses it mostly for tests. The main idea of such tests − run machine code natively first, then translate it to IR and run under emulation, and finally − compare execution results. Example of programs that preforms such tests: tests/test_fib.py and tests/test_rc4.py.

Let's run some simple code under emulation to demonstrate it's API usage:

from pyopenreil.REIL import *
from pyopenreil.VM import *

from pyopenreil.utils import asm

stack_addr = 0x50000
func_addr = 0x1000

# create assembly instruction reader
reader = asm.Reader(ARCH_X86, ( 'mov eax, ecx',
                                'add eax, edx', 
                                'ret' ), addr = func_addr)

# create translator instance
tr = CodeStorageTranslator(reader)

# create virtual CPU instance
cpu = Cpu(ARCH_X86)

# set up stack pointer and input args
cpu.reg('esp').val = stack_addr
cpu.reg('ecx').val = 1 # first arg of test function
cpu.reg('edx').val = 2 # second arg of test function

# run test code until ret
try: cpu.run(tr, func_addr)
except MemReadError as e: 

    # IR code of ret instruction will generate
    # memory access exception because we passed 
    # an invalid stack pointer to test code.
    # So, here we just stopping emulation when such 
    # exception occurs.
    if e.addr != stack_addr: raise

# check for correct return value (1 + 2 = 3)
assert cpu.reg('eax').val == 3

# print current CPU context info into console
cpu.dump()

Console output of the cpu.dump() call:

   R_EAX: 0000000000000003
   R_ECX: 0000000000000001
   R_EDX: 0000000000000002
   R_ESP: 0000000000050000
    R_AF: 0000000000000000
    R_PF: 0000000000000001
    R_OF: 0000000000000000
    R_ZF: 0000000000000000
    R_CF: 0000000000000000
    R_SF: 0000000000000000

For more user friendly interacting with emulated code OpenREIL has pyopenreil.VM.Abi class that allows to allocate emulator memory and place arbitrary data there (Abi.string() and Abi.buffer() methods), or call native code functions using Abi.stdcall(), Abi.cdecl() and Abi.fastcall() methods. There's also pyopenreil.VM.Stack class to represent a stack memory.

Here is example of calling test code from above using pyopenreil.VM.Abi:

# Abi constructor accepts translator and virtual CPU instances
abi = Abi(cpu, tr)

# call our test code using fastcall call convention
ret = abi.fastcall(func_addr, 1, 2)

# check for correct return value
assert ret == 3

Let's try more complex IR code emulation example. Here is the source code of test program that performs RC4 encryption/decryption of some data:

#include <stdio.h>
#include <string.h>

#define MIN(a, b) (((a) < (b)) ? (a) : (b))

typedef struct _RC4_CTX 
{
    unsigned char S[256];
    unsigned char x, y;

} RC4_CTX;

void rc4_swap(unsigned char *a, unsigned char *b)
{
    unsigned char c = *a;

    *a = *b;
    *b = c;
}

void rc4_set_key(RC4_CTX *ctx, unsigned char *key, int key_len)
{
    int i = 0;
    unsigned char x = 0, y = 0;    
    unsigned char *S = ctx->S;

    ctx->x = x = 0;
    ctx->y = y = 0;

    for (i = 0; i < 256; i++)
    {
        S[i] = (unsigned char)i;
    }
    
    for (i = 0; i < 256; i++)
    {
        y = (y + S[i] + key[x]) % 256;

        rc4_swap(&S[i], &S[y]);

        x = (x + 1) % key_len;
    }
}

void rc4_crypt(RC4_CTX *ctx, unsigned char *data, int data_len)
{
    int i = 0;
    unsigned char x = 0, y = 0;
    unsigned char *S = ctx->S;

    for (i = 0; i < data_len; i++)
    {
        x = (x + 1) % 256;
        y = (y + S[x]) % 256;

        rc4_swap(&S[x], &S[y]);

        data[i] ^= S[(S[x] + S[y]) % 256];
    }

    ctx->x = x;
    ctx->y = y;
}

#define MAX_DATA_LEN 255

RC4_CTX m_ctx;

int main(int argc, char *argv[])
{
    if (argc < 3)
    {
        return -1;
    }
    
    // get encryption key and data
    char *key = argv[1], *data = argv[2];
    int key_len = strlen(key);
    int data_len = strlen(data);

    // copy data to local in/out buffer
    unsigned char buff[MAX_DATA_LEN];
    memcpy(buff, data, MIN(MAX_DATA_LEN, data_len));
 
    // perform encryption
    rc4_set_key(&m_ctx, (unsigned char *)key, key_len);
    rc4_crypt(&m_ctx, buff, data_len);

    printf("Encrypted data: ");

    for (i = 0; i < data_len; i++)
    {
        printf("%.2x ", buff[i]);
    }

    printf("\n");

    return 0;
}

Compile it and get VA's of rc4_set_key() and rc4_crypt() functions:

$ gcc tests/rc4.c -o tests/rc4
$ objdump -t tests/rc4 | grep '.text'
080483e0 l    d  .text  00000000              .text
08048410 l     F .text  00000000              __do_global_dtors_aux
08048470 l     F .text  00000000              frame_dummy
08048980 l     F .text  00000000              __do_global_ctors_aux
08048970 g     F .text  00000002              __libc_csu_fini
080484b9 g     F .text  000000df              rc4_set_key
08048972 g     F .text  00000000              .hidden __i686.get_pc_thunk.bx
08048494 g     F .text  00000025              rc4_swap
08048900 g     F .text  00000061              __libc_csu_init
080483e0 g     F .text  00000000              _start
08048674 g     F .text  00000280              main
08048598 g     F .text  000000dc              rc4_crypt

Now let's write a test program that encrypts some data using these functions under emulation and compares encryption results with built-in Python implementation of RC4:

from pyopenreil.REIL import *
from pyopenreil.VM import *

from pyopenreil.utils import bin_BFD

# test input data for RC4 encryption
test_key = 'somekey'
test_val = 'foobar'

# VA's of required functions
rc4_set_key = 0x080484b9
rc4_crypt = 0x08048598

# load test program image
reader = bin_BFD.Reader('tests/rc4')
tr = CodeStorageTranslator(reader)

def code_optimization(addr):

    # construct dataflow graph for given function
    dfg = DFGraphBuilder(tr).traverse(addr)
  
    # run available optimizations
    dfg.optimize_all(tr.storage)

code_optimization(rc4_set_key)
code_optimization(rc4_crypt)

# create CPU and ABI
cpu = Cpu(ARCH_X86)
abi = Abi(cpu, tr)

# allocate buffers for arguments of emulated function
ctx = abi.buff(256 + 4 * 2) # rc4_ctx structure
val = abi.buff(test_val) # data to encrypt

# emulate rc4_set_key() function call
abi.cdecl(rc4_set_key, ctx, test_key, len(test_key))

# emulate rc4_crypt() function call
abi.cdecl(rc4_crypt, ctx, val, len(test_val))

# read results of RC4 encryption
val_1 = abi.read(val, len(test_val))
print 'Encrypted value:', repr(val_1)

# compare results with Python build-in RC4 module
from Crypto.Cipher import ARC4
rc4 = ARC4.new(test_key)
val_2 = rc4.encrypt(test_val)

assert val_1 == val_2

It's took around 5 seconds to execute this code, it shows that Python implementation of IR code emulator is a quite slow. I'm not sure if OpenREIL emulation features will be useful for any research purposes (it seems that no), but as was said above, it helps me a lot with translator testing.

Debugging OpenREIL

OpenREIL Python and C API providies some settings to control debug information output, error messages, etc.

libopenreil has reil_log_init() and reil_log_close() functions that declared in libopenreil.h header file:

/*
    Debug information type constants for mask argument of reil_log_init()
*/
#define REIL_LOG_INFO   0x00000001  // regular message
#define REIL_LOG_WARN   0x00000002  // error
#define REIL_LOG_ERR    0x00000004  // warning
#define REIL_LOG_BIN    0x00000008  // instruction bytes
#define REIL_LOG_ASM    0x00000010  // instruction assembly code
#define REIL_LOG_VEX    0x00000020  // instruction VEX code
#define REIL_LOG_BIL    0x00000040  // instruction BAP IL code

// all log messages
#define REIL_LOG_ALL 0x7FFFFFFF

// disable log messages
#define REIL_LOG_NONE 0

// by default we prints warnings, errors and info to stderr
#define REIL_LOG_DEFAULT (LOG_INFO | LOG_WARN | LOG_ERR)

...

/*
    Initialize debug log. 
    * Individual bits of mask argument tells what kind of information
    we need to include into the log (see REIL_LOG_* constants).
    * Log file path can be specified in path argument, NULL value 
    indicates that we need to print debug information only to stderr.
*/
int reil_log_init(uint32_t mask, const char *path);

/*
    Closes currently opened log file.
*/
void reil_log_close(void);

pyopenreil has similar functions REIL.log_init() and REIL.log_close(), here you can see simple.py Python program that uses debug output file debug.log:

from pyopenreil.REIL import *
from pyopenreil.utils import asm

# set up debug logging
log_init(LOG_ALL, 'debug.log')

# create assembly instruction reader
reader = asm.Reader(ARCH_X86, ( 'xchg dword ptr [esp], edx'), addr = 0)

# create translator instance
tr = CodeStorageTranslator(reader)

# translate first machine instruction
print tr.get_insn(0)

# close currently opened debug log
log_close()

After running this code debug output file will contain the following data:

------------------------ After instrumentation ------------------------

IRSB {
   t0:I32   t1:I32   t2:I32   t3:I32   t4:I32   t5:I1   t6:I32
   ------ IMark(0x1337, 3, 0) ------
   t2 = GET:I32(24)
   t0 = LDle:I32(t2)
   t1 = GET:I32(16)
   t3 = CASle(t2::t0->t1)
   t5 = CasCmpNE32(t3,t0)
   if (t5) { PUT(68) = 0x1337:I32; exit-Boring }
   PUT(16) = t0
   PUT(68) = 0x133A:I32; exit-Boring
}

VexExpansionRatio 3 65   216 :10

BIN { 00001337: 87 14 24 }
ASM { 00001337: xchg dword ptr [esp], edx ; len = 3 }
BAP {
   label pc_0x1337:
   var T_32t0:reg32_t;
   var T_32t1:reg32_t;
   var T_32t2:reg32_t;
   var T_32t3:reg32_t;
   var T_32t4:reg32_t;
   var T_1t5:reg1_t;
   var T_32t6:reg32_t;
   T_32t2:reg32_t = R_ESP:reg32_t;
   T_32t0:reg32_t = mem[T_32t2:reg32_t];
   T_32t1:reg32_t = R_EDX:reg32_t;
   T_32t3:reg32_t = mem[T_32t2:reg32_t];
   cjmp((T_32t3:reg32_t<>T_32t0:reg32_t),name(L_1),name(L_0));
   label L_0:
   mem[T_32t2:reg32_t] = T_32t1:reg32_t;
   label L_1:
   T_1t5:reg1_t = (T_32t3:reg32_t<>T_32t0:reg32_t);
   cjmp(T_1t5:reg1_t,name(pc_0x1337),name(L_2));
   label L_2:
   R_EDX:reg32_t = T_32t0:reg32_t;
   // BAP label pc_0x1337 at 00001337.00
   // BAP label L_0 at 00001337.07
   // BAP label L_1 at 00001337.08
   // BAP label L_2 at 00001337.0b
}

As you can see, it has information about each machine instruction and it's binary, assembly, VEX IR and BIL code that might be very useful for translation errors investigation.

Also, pyopenreil allows you to pass debug logging mask and file path to any OpenREIL program using environment variables REIL_LOG_MASK and REIL_LOG_PATH. Example:

$ REIL_LOG_MASK=0x7fffffff REIL_LOG_PATH=debug.log python simple.py

Source code of pyopenreil has a lot of built-in tests that uses unittest unit testing framework, you can run all tests using test make target:

$ make test

Or run all unit tests manually using tests/run_unittest.py program:

$ REIL_LOG_MASK=0x7fffffff python tests/run_unittest.py

Or run individual unit test:

$ REIL_LOG_MASK=0x7fffffff python tests/run_unittest.py TestArchX86.test_xchg

Using with third party tools

Most of examples in this document was focused on using OpenREIL to develop stand-alone code analysis tools, but it's also possible to use it with any existing reverse engineering tool that supports Python scripting. OpenREIL has build-in support of IDA Pro, GDB and WinDbg as machine code sources.

IDA Pro

Sample IDA Pro script that comes with OpenREIL, ida_translate_func.py, can translate machine functions to IR code and save it into the JSON files:

import sys, os
import idc
from pyopenreil.REIL import *
from pyopenreil.utils import IDA

DEF_ARCH = ARCH_X86

# get address of the current function
addr = idc.ScreenEA()

arch = DEF_ARCH
path = os.path.join(os.getcwd(), 'sub_%.8X.ir' % addr)

# initialise OpenREIL stuff
reader = IDA.Reader(arch)
storage = CodeStorageMem(arch)
tr = CodeStorageTranslator(reader, storage)

# translate function and enumerate it's basic blocks
func = tr.get_func(addr)

for bb in func.bb_list: print bb

print '[*] Saving instructions into the %s' % path
print '[*] %d IR instructions in %d basic blocks translated' % (len(func), len(func.bb_list))

# save function IR into the JSON file
storage.to_file(path)

This script utilises IDAPython API and instructions reader from pyopenreil.utils.IDA to interact with IDA Pro.

To use this script run «File» → «Script File» (Alt-F7) IDA command and open pyopenreil/scripts/ida_translate_func.py, than script will translate machine code of current subroutine into the IR:

OpenREIL Python API diagram

In other script you can use from_file() method of REIL.CodeStorageMem class to load translated instructions from JSON file and do some code analysis.

GDB

OpenREIL translation plugin for GDB is located at pyopenreil/scripts/gdb_reiltrans.py, it implements reiltrans command that allows to generate IR for given instruction, basic block or function and print it to console or save to file.

Usage example:

$ gdb tests/fib
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /vagrant_data/openreil/tests/fib...(no debugging symbols found)...done.
(gdb) source pyopenreil/scripts/gdb_reiltrans.py
(gdb) reiltrans
USAGE: reiltrans insn|block|func <address> [options]
(gdb) info address fib
Symbol "fib" is at 0x8048414 in a file compiled without debugging.
(gdb) reiltrans func 0x8048414 --to-file fib.ir
205 instructions saved to fib.ir

To fetch machine instructions from GDB in your own script − just use Reader class from pyopenreil.utils.GDB module:

import gdb
from pyopenreil.utils import GDB

# Inferriors in GDB API are representing programs
# that being run under debugger.
inferrior = gdb.selected_inferior()

# create instruction reader
reader = GDB.Reader(ARCH_X86, inferrior)

WinDbg (kd, cdb)

WinDbg (and also kd, cdb and others) from Microsoft with pykd Python bindings is also allows to use OpenREIL in similar way.

Plugin for WinDbg is implemented in pyopenreil/scripts/kd_reiltrans.py file, copy it into the winext dir of your WinDbg installation (latest version is usually located at C:\Program Files (x86)\Windows Kits\8.1\Debuggers\x86\winext).

Usage example:

OpenREIL Python API diagram

Machine code reader for WinDbg is located in pyopenreil.utils.kd module:

from pyopenreil.utils import kd

# create instruction reader
reader = kd.Reader(ARCH_X86)

Links and examples

I made a test program that uses OpenREIL and Microsoft Z3 to implement a simple symbolic execution engine for solving Kao's Toy Project crackme.

  • tests/test_kao.py - program source code (link).
  • «Automated algebraic cryptanalysis with OpenREIL and Z3» - my article that explains how the program works (link).
  • «Kao’s “Toy Project” and Algebraic Cryptanalysis» - Dcoder's article with detailed description of Kao's Toy Project crackme and it's solution (PDF).
  • «CONFidence CTF 2015: Reversing 400 "Deobfuscate Me" writeup» - article by H4x0rPsch0rr team about using OpenREIL to solve CTF task (link).

TODO

  • x86_64 support (VEX already has it).
  • Floating point instructions support for x86 and x86_64.
  • SSE instructions support for x86 and x86_64.
  • AVX instructions support for x86 and x86_64.
  • Complete API reference.
  • SMT solvers support.
  • Symbolic execution.
  • REIL → LLVM IR → binary code translation.
  • Examples of libopenreil usage in PIN and DynamoRIO modules.
  • DynamoRIO based tracer that saves IR traces into the MongoDB collection.
  • Out of the box support of LLDB and Immunity Debugger.

About

Open source library that implements translator and tools for REIL (Reverse Engineering Intermediate Language)

License:Other


Languages

Language:C 94.3%Language:C++ 3.6%Language:Python 1.7%Language:Makefile 0.2%Language:Perl 0.1%Language:M4 0.0%Language:Assembly 0.0%Language:Haskell 0.0%Language:Shell 0.0%