aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to find libcudart.so (1.7.1)

kwohlfahrt opened this issue · comments

When running nccl-tests, I see the following error:

nccl-tests-worker-0:39:45 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.7.1-aws
nccl-tests-worker-0:39:45 [0] nccl_net_ofi_cuda_init:39 NCCL WARN NET/OFI Failed to find CUDA Runtime library: libcudart.so: cannot open shared object file: No such file or directory

The base image nvidia/cuda:12-runtime-ubuntu:22.04 only includes the file /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so.12, but not libcudart.so. Adding the symlink as below works around the issue, but it would be good if it worked out of the box, by looking for libcudart.so.*, if this is possible?

RUN ln -s /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so.12 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so

Thank you for the bug report.

Ugh; that's unexpectedly ugly of Nvidia's packaging. We can likely do better than we are doing right now. We've always used the cudart interface; on my todo list has been to figure out how to migrate to the same cuda interface that NCCL itself uses (so that we have a better likelihood of both opening libraries from the same install of CUDA). I notice that they don't check for versioned libraries, perhaps because the libcuda.so symlink is always installed.

Anyway, I'll leave this bug report open until we either move to following what NCCL does with API usage or we add code to look for versioned libraries if the libcudart.so symlink is not found. This won't make it into the 1.7.2 release, but we will try to get it done in the near future.

I see a similar issue with libcuda.so, but nccl-tests appears to run without issue, even with this error.

libfabric:125:1693170751::core:core:cuda_hmem_dl_init():394<warn> Failed to dlopen libcuda.so
libfabric:125:1693170751::core:core:ofi_hmem_init():418<warn> Failed to initialize hmem iface FI_HMEM_CUDA: No data available

Interestingly, when I try the same symlink workaround (linking from /usr/local/cuda-12.0/compat/libcuda.so), then nccl-tests fails with:

libfabric:124:1693170462::core:core:cuda_hmem_verify_devices():578<warn> Failed to perform cudaGetDeviceCount: cudaErrorSystemDriverMismatch:system has unsupported display driver / cuda driver combination
libfabric:124:1693170462::core:core:ofi_hmem_init():418<warn> Failed to initialize hmem iface FI_HMEM_CUDA: Input/output error
nccl-tests-worker-1: Test CUDA failure common.cu:892 'system has unsupported display driver / cuda driver combination'
 .. nccl-tests-worker-1 pid 124: Test failure common.cu:842

The /usr/local/cuda-12.0/compat/libcuda.so is probably not the one you want to link to. Usually there's a host-side libcuda.so provided with the nvidia driver stack. On my Ubuntu 20 AMI, for instance, there is something like this:

lrwxrwxrwx 1 root root       12 Apr  6 08:53 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Apr  6 08:53 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.525.85.12
-rwxr-xr-x 1 root root 29863848 Apr  6 08:53 /usr/lib/x86_64-linux-gnu/libcuda.so.525.85.12
lrwxrwxrwx 1 root root       28 Apr  6 08:53 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1 -> libcudadebugger.so.525.85.12
-rwxr-xr-x 1 root root 10490248 Apr  6 08:53 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.85.12

I'm a little surprised that Nvidia has a container with the libcuda.so symlink missing, mostly because NCCL calls dlopen(libcuda.so) without any versioning code, so if that doesn't work, NCCL isn't going to work.

Our Libfabric team rightly pointed out that today Libfabric requires both libcuda.so and libcudart.so, so even if we change the plugin to use libcuda.so insterfaces instead of libcudart.so interfaces, EFA will still be broken, just in a different way. So adding the libcudart.so symlink is still going to be necessary, even if we change the plugin (on EFA, anyway).

Ah, I must have checked for the libraries on a container running without a GPU, so the library wasn't injected.

I can see the following files present when creating a container with a GPU, but libcuda.so is still absent, only libcuda.so.1 is present:

$ find / -name "libcuda.so*"
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.535.54.03
/usr/local/cuda-12.0/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-12.0/compat/libcuda.so.1
/usr/local/cuda-12.0/compat/libcuda.so
/usr/local/cuda-12.0/compat/libcuda.so.525.105.17

But this is the correct file to symlink, as follows:

RUN ln -s /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so

In fact, this has resolved a nice little performance mystery. Before applying this symlink, I was seeing a maximum of 20GB/s on P4 instances with nccl-tests, and after, I see the expected performance of >40GB/s.

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        4096          1024     float     sum      -1    600.9    0.01    0.01      0    596.7    0.01    0.01      0
       16384          4096     float     sum      -1    601.8    0.03    0.05      0    603.7    0.03    0.05      0
       65536         16384     float     sum      -1   1945.3    0.03    0.06      0   1023.7    0.06    0.12      0
      262144         65536     float     sum      -1   4487.0    0.06    0.11      0   4720.4    0.06    0.10      0
     1048576        262144     float     sum      -1   3588.0    0.29    0.55      0   4071.1    0.26    0.48      0
     4194304       1048576     float     sum      -1   7742.0    0.54    1.02      0   7140.1    0.59    1.10      0
    16777216       4194304     float     sum      -1   5805.4    2.89    5.42      0   4841.5    3.47    6.50      0
    67108864      16777216     float     sum      -1   6199.9   10.82   20.30      0   7473.1    8.98   16.84      0
   268435456      67108864     float     sum      -1    13050   20.57   38.57      0    13254   20.25   37.97      0
  1073741824     268435456     float     sum      -1    47648   22.54   42.25      0    46427   23.13   43.36      0

I'll also add the following context, in case it is useful to anyone else. Before the patch, I was seeing channel info like NCCL INFO Channel 01/0 : 6[a01c0] -> 14[a01c0] [receive] via NET/AWS Libfabric/2, and four channels:

NCCL INFO Channel 00/04 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
NCCL INFO Channel 01/04 :    0   3   1   4  10  15  14  13   8  11   9  12   2   7   6   5
NCCL INFO Channel 02/04 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
NCCL INFO Channel 03/04 :    0   3   1   4  10  15  14  13   8  11   9  12   2   7   6   5

Now, I see channel info like NCCL INFO Channel 05/0 : 3[201d0] -> 10[201c0] [send] via NET/AWS Libfabric/1/GDRDMA (note the /GDRDMA at the end), and eight channels:

NCCL INFO Channel 00/08 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9
NCCL INFO Channel 01/08 :    0   3  10  15  14  13  12   9   8  11   2   7   6   5   4   1
NCCL INFO Channel 02/08 :    0   7   6   5  12  11  10   9   8  15  14  13   4   3   2   1
NCCL INFO Channel 03/08 :    0   5   4   7  14  11  10   9   8  13  12  15   6   3   2   1
NCCL INFO Channel 04/08 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9
NCCL INFO Channel 05/08 :    0   3  10  15  14  13  12   9   8  11   2   7   6   5   4   1
NCCL INFO Channel 06/08 :    0   7   6   5  12  11  10   9   8  15  14  13   4   3   2   1
NCCL INFO Channel 07/08 :    0   5   4   7  14  11  10   9   8  13  12  15   6   3   2   1

Interestingly, the missing libcuda.so hasn't caused any trouble for our real workloads, PyTorch appears to use GPUs without problems.

As of plugin 1.8.0, we use the driver API (libcuda.so) instead of the runtime API (libcudart.so), and load it in a way that matches NCCL's behavior. This should resolve the issues with CUDA runtime versioning.

Please create a new issue if you experience problems with the latest plugin release..