Track source-level program state when debug info is present

Question

Track source-level program state when debug info is present

jryans opened this issue 2 years ago · comments

Context

KLEE tracks program state at the LLVM IR level. For some applications, it would be helpful to know how this maps back to some source-level state in whichever language was compiled to IR.

For example, the following C function...

int example(int n) {
  int y = 0;
  for (unsigned int i = 0; i < n; i++) {
    y += 4 + n;
  }
  return y;
}

...becomes something like the following IR using Clang 13 (-O1)...

define i32 @example(i32 %0) local_unnamed_addr #0 {
  %2 = icmp eq i32 %0, 0
  br i1 %2, label %9, label %3

3:                                                ; preds = %1
  %4 = add i32 %0, -1
  %5 = add i32 %0, 4
  %6 = mul i32 %4, %5
  %7 = add i32 %6, %0
  %8 = add i32 %7, 4
  br label %9

9:                                                ; preds = %3, %1
  %10 = phi i32 [ 0, %1 ], [ %8, %3 ]
  ret i32 %10
}

...which makes no mention of source-level variables like y, and KLEE is thus unable to follow them as it executes. This also means KLEE cannot report errors in terms of source-level variables either.

Desired outcome

Compilers like Clang can add debug info to the LLVM IR (enabled via the -g flag), which traditionally is emitted to a native binary and then read by debuggers like GDB, LLDB, etc. While current KLEE does use the file / line / column annotations in debug info when reporting stack traces, it could go further. As a future enhancement, it would be great for KLEE to use the variable debug info to map its IR-level program state up to source-level constructs when reporting to the user.

Workaround

While it's not the same as a real mapping of variables using debug info, you can get a modestly better view if your compiler names IR values based on source-level constructs. With Clang, you can add -fno-discard-value-names to achieve this, which gives something like the following...

define i32 @example(i32 %n) local_unnamed_addr #0 {
entry:
  %cmp7.not = icmp eq i32 %n, 0
  br i1 %cmp7.not, label %for.cond.cleanup, label %for.cond.cleanup.loopexit

for.cond.cleanup.loopexit:                        ; preds = %entry
  %0 = add i32 %n, -1
  %1 = add i32 %n, 4
  %2 = mul i32 %0, %1
  %3 = add i32 %2, %n
  %4 = add i32 %3, 4
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
  %y.0.lcssa = phi i32 [ 0, %entry ], [ %4, %for.cond.cleanup.loopexit ]
  ret i32 %y.0.lcssa
}

...where some of the IR values (such as %n for the function argument) appear with their source-level names. To be clear, this only tweaks the names alone. An unoptimised version would also have a %y IR value for the source-level variable y, but that value was removed by the optimiser, so we no longer see that name here. Source-level variables move through numerous IR values and memory locations during computation, so this value naming workaround is not enough to follow source-level program state.

J. Ryan Stinnett · Answer 1 · Thu Oct 06 2022 01:16:47 GMT+0800 (China Standard Time)

I am currently working on this source-level support in KLEE as part of my ongoing research. I hope to eventually contribute it back here once it's ready for general use.

MartinNowack · Answer 2 · Thu Oct 06 2022 17:57:52 GMT+0800 (China Standard Time)

@jryans That sounds super interesting.

Just to clarify, KLEE supports debug information as long as your bitcode is compiled with it, i.e. clang-13 -O1 -g -c -emit-llvm would emit debug information as part of the IR as well, i.e. stack traces will contain the correct file/line(/column) information.

But I guess you are more focusing on the variable names? You plan to utilise the llvm.dbg.* intrinsics (https://llvm.org/docs/SourceLevelDebugging.html#format-common-intrinsics) in a more sophisticated way and map them to specific variables?

Sounds great and useful! 😄

J. Ryan Stinnett · Answer 3 · Thu Oct 06 2022 18:19:16 GMT+0800 (China Standard Time)

Just to clarify, KLEE supports debug information as long as your bitcode is compiled with it, i.e. clang-13 -O1 -g -c -emit-llvm would emit debug information as part of the IR as well, i.e. stack traces will contain the correct file/line(/column) information.

Ah of course, I forgot about this use of debug info when writing up the issue. 😅 I have edited my original post to acknowledge this existing support as part of stack trace reporting, so hopefully that will avoid any confusion. 🙂

But I guess you are more focusing on the variable names? You plan to utilise the llvm.dbg.* intrinsics (llvm.org/docs/SourceLevelDebugging.html#format-common-intrinsics) in a more sophisticated way and map them to specific variables?

Yes, exactly. Glad to hear it sounds useful! 😄