NVIDIA / cuDecomp

An Adaptive Pencil Decomposition Library for NVIDIA GPUs

Home Page:https://nvidia.github.io/cuDecomp/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Building on Snellius HPC

chowland opened this issue · comments

Hi, thanks for sharing this library. I'm trying to build it on the Dutch national HPC Snellius, but I have run into trouble with the compilation. The lib stage of the Makefile appears to complete without problems, but once it moves on to the tests stage, it prints out many errors about not being able to find the libraries.

From the example config files, I believe I am pointing all the necessary variables to the right places, but the tests seem unable to find any of the NVIDIA libraries such as nccl.

My config file is as below, and here is my log file from the make command. Please let me know if you can see an obvious fix or have any suggestions.

# Having run
# module load 2022
# module load foss/2022a
# module load NVHPC/22.7
# NVHPC_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/2022
NVHPC_HOME=${EBROOTNVHPC}/Linux_x86_64/22.7

# Required variables to define
# MPICXX=mpicxx
# MPIF90=mpifort
CUDA_HOME=${NVHPC_HOME}/cuda
MPI_HOME=${NVHPC_HOME}/comm_libs/hpcx/latest/ompi
MPICXX=${MPI_HOME}/bin/mpicxx
MPIF90=${MPI_HOME}/bin/mpifort
NCCL_HOME=${NVHPC_HOME}/comm_libs/nccl
CUFFT_HOME=${NVHPC_HOME}/math_libs
CUTENSOR_HOME=${NVHPC_HOME}/math_libs
CUDACXX_HOME=${CUDA_HOME}

# Optional variables
CUDA_CC_LIST=61
BUILD_FORTRAN=1
ENABLE_NVTX=1
ENABLE_NVSHMEM=1
NVSHMEM_HOME=${NVHPC_HOME}/comm_libs/nvshmem

Hi @chowland, thanks for the interest in cuDecomp!

I have posted #6 which should address the cannot open source file "nccl.h" errors in your log.

For the remaining linking issues, could you post what your LD_LIBRARY_PATH is at compile time? I am wondering if there are some missing paths there.

Thanks for the fast response @romerojosh. You're right, the nccl.h errors have now gone in the new branch, but the linking issues are persisting. I can see that loading the NVHPC/22.7 module on Snellius only prepends

/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/compilers/lib

to LD_LIBRARY_PATH so I guess it's not seeing any of the other libraries. Can you point me to the other directories I would need to add? (I got rid of some of the linking issues by adding library folders manually to LD_LIBRARY_PATH, but seemingly didn't catch all of them)

Ok, glad that the PR fixes at least the one error.

For the linking issues, I think you should add at least these paths to the LD_LIBRARY_PATH:

/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/comm_libs/nccl/lib
/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/comm_libs/nvshmem/lib
/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/math_libs/lib64

If that doesn't fix it completely, then also try adding the NVHPC CUDA toolkit directory:

/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/cuda/lib64

Thanks for all your help, I've now managed a successful build. For reference, I had to add quite a hacky workaround to get it to find libcuda and libnvidia-ml, since these were only located in $NVHPC_HOME/cuda/lib64/stubs as libXXXX.so and the linker was looking for libXXXX.so.1. Do you know if this is standard for a NVHPC install or if this would be a system-specific issue?

Hi @chowland, thanks for the update!

Pedro Costa who has access to this system helped me sort this out offline. Seems some of the linking issues you encountered are due to compiling cuDecomp on a system without a CUDA driver installed. This is the source of the linking issues with libcuda.so and libnvidia-ml. To handle this use case, I needed to add links to CUDA stub libraries in the test/example compilation commands. I updated #5 to include these changes.

Could you try that updated branch with the suggested LD_LIBRARY_PATH changes in my previous comment and see if that works for you (without any hacky workarounds 😄 )?

I thought that might not be the most straightforward way of doing it! Just tried with the updated PR and it works perfectly. I might get in touch with the HPC service desk to suggest adding the libraries to LD_LIBRARY_PATH when the module is loaded, but from the cuDecomp side, I'm happy for this issue to be closed. Thanks so much

Thanks a lot for the rapid feedback. Will close this issue when I merge the PR.

Feel free to reach out again if you have other issues integrating the library in your project.