ib_write_bw --cuda will lead to system deallock
antonywei opened this issue · comments
client
mlx5 nic
./ib_write_bw -d mlx5_0 -i 1 --use_cuda=0 server_ip_address -a
server
mlx5 NIC
run: ./ib_write_bw -d mlx5_0 -i 1 --use_cuda=0
when pressing ctrl+c to kill the process, the hole system will crash and report system deadlock.
it will not happened if we don't use the param --use_cuda;
can you copy the crash dump here?
It seems the system has crashed before writing the core dump files, maybe the reason is ib_write_bw will not release GPU resources there are some problems (for example RNR error). however, the Cuda and kernel didn't release these resources and lead to the system crash.
I tried to reproduce it with loopback, and it didnt reproduce.
i pressed the ctrl+c while passing traffic and also when allocating the GPU buffer.
can you tell what is the exact time you tried to kill the process?
Closing the Issue, Please re-open if reproduce.