Building on Snellius HPC

Question

Building on Snellius HPC

chowland opened this issue 2 years ago · comments

Hi, thanks for sharing this library. I'm trying to build it on the Dutch national HPC Snellius, but I have run into trouble with the compilation. The lib stage of the Makefile appears to complete without problems, but once it moves on to the tests stage, it prints out many errors about not being able to find the libraries.

From the example config files, I believe I am pointing all the necessary variables to the right places, but the tests seem unable to find any of the NVIDIA libraries such as nccl.

My config file is as below, and here is my log file from the make command. Please let me know if you can see an obvious fix or have any suggestions.

# Having run
# module load 2022
# module load foss/2022a
# module load NVHPC/22.7
# NVHPC_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/2022
NVHPC_HOME=${EBROOTNVHPC}/Linux_x86_64/22.7

# Required variables to define
# MPICXX=mpicxx
# MPIF90=mpifort
CUDA_HOME=${NVHPC_HOME}/cuda
MPI_HOME=${NVHPC_HOME}/comm_libs/hpcx/latest/ompi
MPICXX=${MPI_HOME}/bin/mpicxx
MPIF90=${MPI_HOME}/bin/mpifort
NCCL_HOME=${NVHPC_HOME}/comm_libs/nccl
CUFFT_HOME=${NVHPC_HOME}/math_libs
CUTENSOR_HOME=${NVHPC_HOME}/math_libs
CUDACXX_HOME=${CUDA_HOME}

# Optional variables
CUDA_CC_LIST=61
BUILD_FORTRAN=1
ENABLE_NVTX=1
ENABLE_NVSHMEM=1
NVSHMEM_HOME=${NVHPC_HOME}/comm_libs/nvshmem

romerojosh · Answer 1 · Wed Dec 14 2022 23:54:32 GMT+0800 (China Standard Time)

Hi @chowland, thanks for the interest in cuDecomp!

I have posted #6 which should address the cannot open source file "nccl.h" errors in your log.

For the remaining linking issues, could you post what your LD_LIBRARY_PATH is at compile time? I am wondering if there are some missing paths there.

Chris Howland · Answer 2 · Thu Dec 15 2022 00:29:07 GMT+0800 (China Standard Time)

Thanks for the fast response @romerojosh. You're right, the nccl.h errors have now gone in the new branch, but the linking issues are persisting. I can see that loading the NVHPC/22.7 module on Snellius only prepends

/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/compilers/lib

to LD_LIBRARY_PATH so I guess it's not seeing any of the other libraries. Can you point me to the other directories I would need to add? (I got rid of some of the linking issues by adding library folders manually to LD_LIBRARY_PATH, but seemingly didn't catch all of them)

romerojosh · Answer 3 · Thu Dec 15 2022 00:53:01 GMT+0800 (China Standard Time)

Ok, glad that the PR fixes at least the one error.

For the linking issues, I think you should add at least these paths to the LD_LIBRARY_PATH:

/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/comm_libs/nccl/lib
/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/comm_libs/nvshmem/lib
/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/math_libs/lib64

If that doesn't fix it completely, then also try adding the NVHPC CUDA toolkit directory:

/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/cuda/lib64

Chris Howland · Answer 4 · Thu Dec 15 2022 01:34:45 GMT+0800 (China Standard Time)

Thanks for all your help, I've now managed a successful build. For reference, I had to add quite a hacky workaround to get it to find libcuda and libnvidia-ml, since these were only located in $NVHPC_HOME/cuda/lib64/stubs as libXXXX.so and the linker was looking for libXXXX.so.1. Do you know if this is standard for a NVHPC install or if this would be a system-specific issue?

romerojosh · Answer 5 · Thu Dec 15 2022 01:38:47 GMT+0800 (China Standard Time)

Hi @chowland, thanks for the update!

Pedro Costa who has access to this system helped me sort this out offline. Seems some of the linking issues you encountered are due to compiling cuDecomp on a system without a CUDA driver installed. This is the source of the linking issues with libcuda.so and libnvidia-ml. To handle this use case, I needed to add links to CUDA stub libraries in the test/example compilation commands. I updated #5 to include these changes.

Could you try that updated branch with the suggested LD_LIBRARY_PATH changes in my previous comment and see if that works for you (without any hacky workarounds 😄 )?

Chris Howland · Answer 6 · Thu Dec 15 2022 01:58:07 GMT+0800 (China Standard Time)

I thought that might not be the most straightforward way of doing it! Just tried with the updated PR and it works perfectly. I might get in touch with the HPC service desk to suggest adding the libraries to LD_LIBRARY_PATH when the module is loaded, but from the cuDecomp side, I'm happy for this issue to be closed. Thanks so much

romerojosh · Answer 7 · Thu Dec 15 2022 02:02:34 GMT+0800 (China Standard Time)

Thanks a lot for the rapid feedback. Will close this issue when I merge the PR.

Feel free to reach out again if you have other issues integrating the library in your project.