yegord / snowman

I might be completely crazy but I started decompiling a shared object the other day and found the output to be really confusing. When I started looking at objdumps output I realized there would be quite often an odd number of 0x00s after returns and jumps that was causing the next opcode to look like an add instead of a move (most commonly I see moves). I might be completely wrong about this but after spending some very tedious time hex editing a copy of the library to replace 0x00 with 0x90 (x86_64 nop) the decompile output started to look more clear in both snowman and objdump.

Am I right in this evaluation? I havent been able to look through how snowman parses the opcodes yet but I would love some feedback before going down a rabbit hole fit for a noob.

It would be great to see a minimal example.

You can disassemble a range of addresses, either using GUI, or the command-line decompiler (--from, --to, --print-instructions). If a range starting with 0x00 0x00, followed by a sensible instruction, is disassembled correctly, and a range starting with 0x00, followed by the same sensible instruction, is not, we probably have a bug.

Could you provide such an example (essentially, several bytes of executable code)?

ill need some time to get everything setup on my box at home but in the mean time...
what I see is something like this out of objdump:
12c366: c3 ret
12c367: 00 00 add
12c369: 00 00 add
12c36b: 00 48 89 add
12c36e: 5c pop
12c36f: 24 e8 and
12c370: 48 89 6c 24 f0 mov

somewhere else in the code I might see a call or jump to address 12c36c which is seems like it makes the add at 12c36b erroneous.

snowmans output for the same function shows the add operations and If I am following decode.c properly it set the instruction offset at the same incorrect address.
Ill try to get a full snippet from the binary later.

Probably what you see is the following.

The decompiler is pretty stupid at deciding where an instruction begins.
It just tries to disassemble starting from the 0th byte in the code section.
If disassembling succeeds, it takes the size of the instruction, adds it to the address of the instruction, and tries to disassemble the instruction at this address.
If disassembling fails, it adds 1 (on x86) to the address of the previous attempt and tries again.

So, if for "00 48 89" libudis86 says this is a 3-bytes-longs add instruction, the decompiler trusts it and goes on with disassembling the immediately following "5c".

Ideally, one should do something more clever: like first identifying jump destinations, then disassembling everything starting from these jump destinations, and then trying to disassemble all the rest using something like the current approach.

So, IR generation (when you have IR, you know what is a jump, and can estimate its destination) should be intertwined with disassembling.

Agreed. I trued building a test object but couldn't get the padding to show up between functions but was able to get a simple hack to skip some cases of zeros after UD_Iret and UD_Ijmp. Im still trying to trace it all out. Its not as simple as I had thought.

Blank lines between instructions have any specific meaning?

1dce5b: mov [rsp], edx

1dce5f: jz 0x1dce5f

1dce62: inc dword [rax]

I think this is getting misinterpreted too. Right after this is another group of adds from odd zero padding.

Blank lines between instructions mean that one instruction begins not where the previous ends:

snowman/src/nc/core/arch/Instructions.cpp

Line 76 in aba7afe

if (instr->addr() != successorAddress) {

odd number of 0x00s after return or jumps possibly causing misinterpretation of next opcode