HPCToolkit / hpctoolkit

HPCToolkit performance tools: measurement and analysis components

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using LD_AUDIT to wrap cuLaunchKernel does not always work on ufront

Jokeren opened this issue · comments

We bind cuLaunchKernel to its wrapper using la_symbind64 to make the wrapper is called before the actual function. However, this approach works on gpu.cs only but not ufront. On ufront, the wrapper is called 1 out of ~20 times. In other words, most of the time only the wrappee is called.

I've tinkered with this issue for a few hours, it looks like however libcudart (shared or static) obtains a pointer to cuLaunchKernel on ufront.cs is outside of the usual linker path. From the assembly it looks like its stashed in a static variable.

Which leaves the question of why gpu.cs works. @Jokeren if you have bits set up over there, can you run the app with LD_DEBUG=bindings and search for cuLaunchKernel? On ufront.cs I only get a match here:

   2772970:     binding file /usr/local/cuda-11.4/lib64/libcublasLt.so.11 [0] to /lib64/libdl.so.2 [0]: normal symbol `dlsym' [GLIBC_2.2.5]
   2772970:     binding file /lib64/libcuda.so.1 [0] to /lib64/libcuda.so.1 [0]: normal symbol `cuDriverGetVersion'
... more similar binding messages for other cu* functions ...
   2772970:     binding file /lib64/libcuda.so.1 [0] to /lib64/libcuda.so.1 [0]: normal symbol `cuLaunchKernel'

I've investigated these, they are (1) obtained via a dlsym in libcublasLt and (2) are properly notified and overridden. If there is another match on gpu.cs it might be that the CUDAs are slightly different.

Regardless, this is probably an indication that cuLaunchKernel is not always interceptable between CUDA libraries. I noticed cudaLaunchKernel is the external function that calls cuLaunchKernel, is there a reason that won't work as a wrapper target?

Regardless, this is probably an indication that cuLaunchKernel is not always interceptable between CUDA libraries. I noticed cudaLaunchKernel is the external function that calls cuLaunchKernel, is there a reason that won't work as a wrapper target?

cudaLaunchKernel does not have CUfunction in its arguments, from which we need to extract function_id and module_id for use in hpcrun.

@Jokeren @blue42u I think this is caused by some defensive programming on the NVIDIA side to prevent other people from overriding (or reverse engineering in general) their code.

Below is the disassembly code of cuLaunchKernel in libcuda.so on ufront:

00000000001fab10 <cuLaunchKernel@@Base>:
  1fab10:       81 3d be e2 54 01 00    cmpl   $0x321cba00,0x154e2be(%rip)        # 1748dd8 <cudbgEnableLaunchBlocking@@Base+0xf20>
  1fab17:       ba 1c 32 
  1fab1a:       74 14                   je     1fab30 <cuLaunchKernel@@Base+0x20>
  1fab1c:       ff 25 d6 ec 43 01       jmpq   *0x143ecd6(%rip)        # 16397f8 <cudbgApiInit@@Base+0x120da78>
  1fab22:       0f 1f 40 00             nopl   0x0(%rax)
  1fab26:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  1fab2d:       00 00 00 
  1fab30:       b8 04 00 00 00          mov    $0x4,%eax
  1fab35:       c3                      retq   

You can see that the function itself does nothing but tail calls a function pointer. If Nvidia libraries internally only uses this function pointer, then wrapping this public cuLaunchKernel will do nothing.

However, on gpu.cs.rice.edu, cuLaunchKernel itself actually does the work:

00000000002412e0 <cuLaunchKernel@@Base>:
  2412e0:       41 55                   push   %r13
  2412e2:       41 54                   push   %r12
  2412e4:       55                      push   %rbp
  2412e5:       53                      push   %rbx
  2412e6:       48 81 ec e8 00 00 00    sub    $0xe8,%rsp
  2412ed:       81 3d e1 82 46 01 00    cmpl   $0x321cba00,0x14682e1(%rip)        # 16a95d8 <cudbgEnableLaunchBlocking@@Base+0xea0>
  2412f4:       ba 1c 32 
  2412f7:       c7 44 24 10 e7 03 00    movl   $0x3e7,0x10(%rsp)
  2412fe:       00 
  2412ff:       48 c7 44 24 20 00 00    movq   $0x0,0x20(%rsp)
  241306:       00 00 
  241308:       48 c7 44 24 18 00 00    movq   $0x0,0x18(%rsp)
  24130f:       00 00 
  241311:       0f 84 29 02 00 00       je     241540 <cuLaunchKernel@@Base+0x260>
  241317:       8b 05 4f 6b 46 01       mov    0x1466b4f(%rip),%eax        # 16a7e6c <cudbgReportDriverApiErrorFlags@@Base+0x1dbc>
  24131d:       41 89 cd                mov    %ecx,%r13d
  241320:       41 89 d4                mov    %edx,%r12d
  241323:       89 f3                   mov    %esi,%ebx
  241325:       48 89 fd                mov    %rdi,%rbp
  241328:       85 c0                   test   %eax,%eax
  24132a:       75 54                   jne    241380 <cuLaunchKernel@@Base+0xa0>
....
many more code

So, wrapping it on gpu will work.

@mxz297 Thanks for reminding me that two systems actually have different cuda drivers. It makes sense to me now.

Seems like the only way to wrap such a function pointer invocation is using instrumentation?

@Jokeren It requires a little bit more binary analysis before we can do normal instrumentation. For the libcuda.so on ufront, we need to know the tail call target. This is not too hard to do manually as I find there is a relocation entry for it.

After getting the tail call target, then we can do the normal function entry & exit instrumentation to behave like a function wrapping.

Got you, yet another tricky thing is that libcuda.so is a system driver. I don't think we want to directly instrument it as it's shared among all users. One potential trial is to instrument and use libcuda.so in a container.

For now, I'll use gpu.cs. Let's have a quick discussion in the group meeting or sometime later.

Got you, yet another tricky thing is that libcuda.so is a system driver. I don't think we want to directly instrument it as it's shared among all users. One potential trial is to instrument and use libcuda.so in a container.

We definitely can instrument the libcuda.so. We can either use dynamic instrumentation (which will only impact the libcuda.so in a particular run) or use binary rewriting to generate a new libcuda.so and put it in your own path, and preload the rewritten libcuda.so.

For now, I'll use gpu.cs. Let's have a quick discussion in the group meeting or sometime later.

Sure.