Unable to find libcudart.so (1.7.1)
kwohlfahrt opened this issue · comments
When running nccl-tests
, I see the following error:
nccl-tests-worker-0:39:45 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.7.1-aws
nccl-tests-worker-0:39:45 [0] nccl_net_ofi_cuda_init:39 NCCL WARN NET/OFI Failed to find CUDA Runtime library: libcudart.so: cannot open shared object file: No such file or directory
The base image nvidia/cuda:12-runtime-ubuntu:22.04
only includes the file /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so.12
, but not libcudart.so
. Adding the symlink as below works around the issue, but it would be good if it worked out of the box, by looking for libcudart.so.*
, if this is possible?
RUN ln -s /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so.12 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so
Thank you for the bug report.
Ugh; that's unexpectedly ugly of Nvidia's packaging. We can likely do better than we are doing right now. We've always used the cudart interface; on my todo list has been to figure out how to migrate to the same cuda interface that NCCL itself uses (so that we have a better likelihood of both opening libraries from the same install of CUDA). I notice that they don't check for versioned libraries, perhaps because the libcuda.so symlink is always installed.
Anyway, I'll leave this bug report open until we either move to following what NCCL does with API usage or we add code to look for versioned libraries if the libcudart.so symlink is not found. This won't make it into the 1.7.2 release, but we will try to get it done in the near future.
I see a similar issue with libcuda.so
, but nccl-tests
appears to run without issue, even with this error.
libfabric:125:1693170751::core:core:cuda_hmem_dl_init():394<warn> Failed to dlopen libcuda.so
libfabric:125:1693170751::core:core:ofi_hmem_init():418<warn> Failed to initialize hmem iface FI_HMEM_CUDA: No data available
Interestingly, when I try the same symlink workaround (linking from /usr/local/cuda-12.0/compat/libcuda.so
), then nccl-tests
fails with:
libfabric:124:1693170462::core:core:cuda_hmem_verify_devices():578<warn> Failed to perform cudaGetDeviceCount: cudaErrorSystemDriverMismatch:system has unsupported display driver / cuda driver combination
libfabric:124:1693170462::core:core:ofi_hmem_init():418<warn> Failed to initialize hmem iface FI_HMEM_CUDA: Input/output error
nccl-tests-worker-1: Test CUDA failure common.cu:892 'system has unsupported display driver / cuda driver combination'
.. nccl-tests-worker-1 pid 124: Test failure common.cu:842
The /usr/local/cuda-12.0/compat/libcuda.so
is probably not the one you want to link to. Usually there's a host-side libcuda.so provided with the nvidia driver stack. On my Ubuntu 20 AMI, for instance, there is something like this:
lrwxrwxrwx 1 root root 12 Apr 6 08:53 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 20 Apr 6 08:53 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.525.85.12
-rwxr-xr-x 1 root root 29863848 Apr 6 08:53 /usr/lib/x86_64-linux-gnu/libcuda.so.525.85.12
lrwxrwxrwx 1 root root 28 Apr 6 08:53 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1 -> libcudadebugger.so.525.85.12
-rwxr-xr-x 1 root root 10490248 Apr 6 08:53 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.85.12
I'm a little surprised that Nvidia has a container with the libcuda.so symlink missing, mostly because NCCL calls dlopen(libcuda.so)
without any versioning code, so if that doesn't work, NCCL isn't going to work.
Our Libfabric team rightly pointed out that today Libfabric requires both libcuda.so
and libcudart.so
, so even if we change the plugin to use libcuda.so
insterfaces instead of libcudart.so
interfaces, EFA will still be broken, just in a different way. So adding the libcudart.so symlink is still going to be necessary, even if we change the plugin (on EFA, anyway).
Ah, I must have checked for the libraries on a container running without a GPU, so the library wasn't injected.
I can see the following files present when creating a container with a GPU, but libcuda.so
is still absent, only libcuda.so.1
is present:
$ find / -name "libcuda.so*"
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.535.54.03
/usr/local/cuda-12.0/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-12.0/compat/libcuda.so.1
/usr/local/cuda-12.0/compat/libcuda.so
/usr/local/cuda-12.0/compat/libcuda.so.525.105.17
But this is the correct file to symlink, as follows:
RUN ln -s /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so
In fact, this has resolved a nice little performance mystery. Before applying this symlink, I was seeing a maximum of 20GB/s on P4 instances with nccl-tests
, and after, I see the expected performance of >40GB/s.
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
4096 1024 float sum -1 600.9 0.01 0.01 0 596.7 0.01 0.01 0
16384 4096 float sum -1 601.8 0.03 0.05 0 603.7 0.03 0.05 0
65536 16384 float sum -1 1945.3 0.03 0.06 0 1023.7 0.06 0.12 0
262144 65536 float sum -1 4487.0 0.06 0.11 0 4720.4 0.06 0.10 0
1048576 262144 float sum -1 3588.0 0.29 0.55 0 4071.1 0.26 0.48 0
4194304 1048576 float sum -1 7742.0 0.54 1.02 0 7140.1 0.59 1.10 0
16777216 4194304 float sum -1 5805.4 2.89 5.42 0 4841.5 3.47 6.50 0
67108864 16777216 float sum -1 6199.9 10.82 20.30 0 7473.1 8.98 16.84 0
268435456 67108864 float sum -1 13050 20.57 38.57 0 13254 20.25 37.97 0
1073741824 268435456 float sum -1 47648 22.54 42.25 0 46427 23.13 43.36 0
I'll also add the following context, in case it is useful to anyone else. Before the patch, I was seeing channel info like NCCL INFO Channel 01/0 : 6[a01c0] -> 14[a01c0] [receive] via NET/AWS Libfabric/2
, and four channels:
NCCL INFO Channel 00/04 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NCCL INFO Channel 01/04 : 0 3 1 4 10 15 14 13 8 11 9 12 2 7 6 5
NCCL INFO Channel 02/04 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
NCCL INFO Channel 03/04 : 0 3 1 4 10 15 14 13 8 11 9 12 2 7 6 5
Now, I see channel info like NCCL INFO Channel 05/0 : 3[201d0] -> 10[201c0] [send] via NET/AWS Libfabric/1/GDRDMA
(note the /GDRDMA
at the end), and eight channels:
NCCL INFO Channel 00/08 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
NCCL INFO Channel 01/08 : 0 3 10 15 14 13 12 9 8 11 2 7 6 5 4 1
NCCL INFO Channel 02/08 : 0 7 6 5 12 11 10 9 8 15 14 13 4 3 2 1
NCCL INFO Channel 03/08 : 0 5 4 7 14 11 10 9 8 13 12 15 6 3 2 1
NCCL INFO Channel 04/08 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
NCCL INFO Channel 05/08 : 0 3 10 15 14 13 12 9 8 11 2 7 6 5 4 1
NCCL INFO Channel 06/08 : 0 7 6 5 12 11 10 9 8 15 14 13 4 3 2 1
NCCL INFO Channel 07/08 : 0 5 4 7 14 11 10 9 8 13 12 15 6 3 2 1
Interestingly, the missing libcuda.so
hasn't caused any trouble for our real workloads, PyTorch appears to use GPUs without problems.
As of plugin 1.8.0, we use the driver API (libcuda.so
) instead of the runtime API (libcudart.so
), and load it in a way that matches NCCL's behavior. This should resolve the issues with CUDA runtime versioning.
Please create a new issue if you experience problems with the latest plugin release..