ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Full symbols in a libfabric stack trace?

mwheinz opened this issue · comments

Hey, guys,

I’m trying to track down what expresses as a PSM3 error report but which I suspect is a NCCL bug. To do that I’m trying to get a symbolic stack trace of the executable when I call abort() inside PSM3 – but simply adding –enable-debug to the libfabric configure doesn’t seem to work.

Any ideas? The current configure I'm using is:

./autogen.sh && ./configure --prefix=${HOME} --enable-debug --with-cuda=/usr/local/cuda-11.6 --enable-cuda-dlopen --enable-only --enable-psm3

Sometimes I find I need to explicitly set CFLAGS="-g -O0" to fully enable the gdb-able build.

Sometimes I find I need to explicitly set CFLAGS="-g -O0" to fully enable the gdb-able build.

I'll give that a try. Thanks.

That's weird. What is the output of grep ^CFLAGS Makefile?

CFLAGS = -g -O0 -Wall -Wundef -Wpointer-arith -Wextra -Wno-unused-parameter -Wno-sign -compare -Wno-missing-field-initializers -fstack-protector-strong -fvisibility=hidde n -Wall -Wundef -Wpointer-arith

however, the backtrace still looks like this:

[octo2:3400787] Signal code:  (-6)
[octo2:3400787] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7fbd602d3b20]
[octo2:3400787] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fbd5f7a937f]
[octo2:3400787] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fbd5f793db5]
[octo2:3400787] [ 3] /home/mwheinz/lib/libfabric.so.1(+0x9e33a)[0x7fbcec83c33a]
[octo2:3400787] [ 4] /home/mwheinz/lib/libfabric.so.1(+0x9eab7)[0x7fbcec83cab7]
[octo2:3400787] [ 5] /home/mwheinz/lib/libfabric.so.1(+0x9fa59)[0x7fbcec83da59]
[octo2:3400787] [ 6] /home/mwheinz/aws-ofi-nccl/lib/libnccl-net.so(+0x48d3)[0x7fbcd01448d3]
[octo2:3400787] [ 7] /home/mwheinz/horovod/0.22.1-ompi-4.1.3-cuda-ofi-nccl/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x177eb4)[0x7fbcfe7b7eb4]```

Looks like a glibc backtrace - you can resolve this with addr2line or eu-addr2line. I had created a script for our project to automate that. Or use libbacktrace, which provides auto resolved lines (unless debug symbols are stripped).

@aakefbs - I still can't figure out why abort() didn't produce the function names, but addr2line worked perfectly. Thanks.