Using LD_AUDIT to wrap cuLaunchKernel does not always work on ufront
Jokeren opened this issue · comments
We bind cuLaunchKernel to its wrapper using la_symbind64
to make the wrapper is called before the actual function. However, this approach works on gpu.cs only but not ufront. On ufront, the wrapper is called 1 out of ~20 times. In other words, most of the time only the wrappee is called.
I've tinkered with this issue for a few hours, it looks like however libcudart (shared or static) obtains a pointer to cuLaunchKernel
on ufront.cs is outside of the usual linker path. From the assembly it looks like its stashed in a static
variable.
Which leaves the question of why gpu.cs works. @Jokeren if you have bits set up over there, can you run the app with LD_DEBUG=bindings
and search for cuLaunchKernel
? On ufront.cs I only get a match here:
2772970: binding file /usr/local/cuda-11.4/lib64/libcublasLt.so.11 [0] to /lib64/libdl.so.2 [0]: normal symbol `dlsym' [GLIBC_2.2.5]
2772970: binding file /lib64/libcuda.so.1 [0] to /lib64/libcuda.so.1 [0]: normal symbol `cuDriverGetVersion'
... more similar binding messages for other cu* functions ...
2772970: binding file /lib64/libcuda.so.1 [0] to /lib64/libcuda.so.1 [0]: normal symbol `cuLaunchKernel'
I've investigated these, they are (1) obtained via a dlsym
in libcublasLt
and (2) are properly notified and overridden. If there is another match on gpu.cs it might be that the CUDAs are slightly different.
Regardless, this is probably an indication that cuLaunchKernel
is not always interceptable between CUDA libraries. I noticed cudaLaunchKernel
is the external function that calls cuLaunchKernel
, is there a reason that won't work as a wrapper target?
Regardless, this is probably an indication that cuLaunchKernel is not always interceptable between CUDA libraries. I noticed cudaLaunchKernel is the external function that calls cuLaunchKernel, is there a reason that won't work as a wrapper target?
cudaLaunchKernel does not have CUfunction in its arguments, from which we need to extract function_id and module_id for use in hpcrun.
@Jokeren @blue42u I think this is caused by some defensive programming on the NVIDIA side to prevent other people from overriding (or reverse engineering in general) their code.
Below is the disassembly code of cuLaunchKernel
in libcuda.so on ufront:
00000000001fab10 <cuLaunchKernel@@Base>:
1fab10: 81 3d be e2 54 01 00 cmpl $0x321cba00,0x154e2be(%rip) # 1748dd8 <cudbgEnableLaunchBlocking@@Base+0xf20>
1fab17: ba 1c 32
1fab1a: 74 14 je 1fab30 <cuLaunchKernel@@Base+0x20>
1fab1c: ff 25 d6 ec 43 01 jmpq *0x143ecd6(%rip) # 16397f8 <cudbgApiInit@@Base+0x120da78>
1fab22: 0f 1f 40 00 nopl 0x0(%rax)
1fab26: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
1fab2d: 00 00 00
1fab30: b8 04 00 00 00 mov $0x4,%eax
1fab35: c3 retq
You can see that the function itself does nothing but tail calls a function pointer. If Nvidia libraries internally only uses this function pointer, then wrapping this public cuLaunchKernel
will do nothing.
However, on gpu.cs.rice.edu, cuLaunchKernel
itself actually does the work:
00000000002412e0 <cuLaunchKernel@@Base>:
2412e0: 41 55 push %r13
2412e2: 41 54 push %r12
2412e4: 55 push %rbp
2412e5: 53 push %rbx
2412e6: 48 81 ec e8 00 00 00 sub $0xe8,%rsp
2412ed: 81 3d e1 82 46 01 00 cmpl $0x321cba00,0x14682e1(%rip) # 16a95d8 <cudbgEnableLaunchBlocking@@Base+0xea0>
2412f4: ba 1c 32
2412f7: c7 44 24 10 e7 03 00 movl $0x3e7,0x10(%rsp)
2412fe: 00
2412ff: 48 c7 44 24 20 00 00 movq $0x0,0x20(%rsp)
241306: 00 00
241308: 48 c7 44 24 18 00 00 movq $0x0,0x18(%rsp)
24130f: 00 00
241311: 0f 84 29 02 00 00 je 241540 <cuLaunchKernel@@Base+0x260>
241317: 8b 05 4f 6b 46 01 mov 0x1466b4f(%rip),%eax # 16a7e6c <cudbgReportDriverApiErrorFlags@@Base+0x1dbc>
24131d: 41 89 cd mov %ecx,%r13d
241320: 41 89 d4 mov %edx,%r12d
241323: 89 f3 mov %esi,%ebx
241325: 48 89 fd mov %rdi,%rbp
241328: 85 c0 test %eax,%eax
24132a: 75 54 jne 241380 <cuLaunchKernel@@Base+0xa0>
....
many more code
So, wrapping it on gpu will work.
@mxz297 Thanks for reminding me that two systems actually have different cuda drivers. It makes sense to me now.
Seems like the only way to wrap such a function pointer invocation is using instrumentation?
@Jokeren It requires a little bit more binary analysis before we can do normal instrumentation. For the libcuda.so on ufront, we need to know the tail call target. This is not too hard to do manually as I find there is a relocation entry for it.
After getting the tail call target, then we can do the normal function entry & exit instrumentation to behave like a function wrapping.
Got you, yet another tricky thing is that libcuda.so is a system driver. I don't think we want to directly instrument it as it's shared among all users. One potential trial is to instrument and use libcuda.so in a container.
For now, I'll use gpu.cs. Let's have a quick discussion in the group meeting or sometime later.
Got you, yet another tricky thing is that libcuda.so is a system driver. I don't think we want to directly instrument it as it's shared among all users. One potential trial is to instrument and use libcuda.so in a container.
We definitely can instrument the libcuda.so. We can either use dynamic instrumentation (which will only impact the libcuda.so in a particular run) or use binary rewriting to generate a new libcuda.so and put it in your own path, and preload the rewritten libcuda.so.
For now, I'll use gpu.cs. Let's have a quick discussion in the group meeting or sometime later.
Sure.