Building on Snellius HPC
chowland opened this issue · comments
Hi, thanks for sharing this library. I'm trying to build it on the Dutch national HPC Snellius, but I have run into trouble with the compilation. The lib
stage of the Makefile
appears to complete without problems, but once it moves on to the tests
stage, it prints out many errors about not being able to find the libraries.
From the example config files, I believe I am pointing all the necessary variables to the right places, but the tests seem unable to find any of the NVIDIA libraries such as nccl
.
My config file is as below, and here is my log file from the make
command. Please let me know if you can see an obvious fix or have any suggestions.
# Having run
# module load 2022
# module load foss/2022a
# module load NVHPC/22.7
# NVHPC_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/2022
NVHPC_HOME=${EBROOTNVHPC}/Linux_x86_64/22.7
# Required variables to define
# MPICXX=mpicxx
# MPIF90=mpifort
CUDA_HOME=${NVHPC_HOME}/cuda
MPI_HOME=${NVHPC_HOME}/comm_libs/hpcx/latest/ompi
MPICXX=${MPI_HOME}/bin/mpicxx
MPIF90=${MPI_HOME}/bin/mpifort
NCCL_HOME=${NVHPC_HOME}/comm_libs/nccl
CUFFT_HOME=${NVHPC_HOME}/math_libs
CUTENSOR_HOME=${NVHPC_HOME}/math_libs
CUDACXX_HOME=${CUDA_HOME}
# Optional variables
CUDA_CC_LIST=61
BUILD_FORTRAN=1
ENABLE_NVTX=1
ENABLE_NVSHMEM=1
NVSHMEM_HOME=${NVHPC_HOME}/comm_libs/nvshmem
Thanks for the fast response @romerojosh. You're right, the nccl.h
errors have now gone in the new branch, but the linking issues are persisting. I can see that loading the NVHPC/22.7
module on Snellius only prepends
/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/compilers/lib
to LD_LIBRARY_PATH
so I guess it's not seeing any of the other libraries. Can you point me to the other directories I would need to add? (I got rid of some of the linking issues by adding library folders manually to LD_LIBRARY_PATH
, but seemingly didn't catch all of them)
Ok, glad that the PR fixes at least the one error.
For the linking issues, I think you should add at least these paths to the LD_LIBRARY_PATH
:
/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/comm_libs/nccl/lib
/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/comm_libs/nvshmem/lib
/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/math_libs/lib64
If that doesn't fix it completely, then also try adding the NVHPC CUDA toolkit directory:
/sw/arch/RHEL8/EB_production/2022/software/NVHPC/22.7/Linux_x86_64/22.7/cuda/lib64
Thanks for all your help, I've now managed a successful build. For reference, I had to add quite a hacky workaround to get it to find libcuda
and libnvidia-ml
, since these were only located in $NVHPC_HOME/cuda/lib64/stubs
as libXXXX.so
and the linker was looking for libXXXX.so.1
. Do you know if this is standard for a NVHPC install or if this would be a system-specific issue?
Hi @chowland, thanks for the update!
Pedro Costa who has access to this system helped me sort this out offline. Seems some of the linking issues you encountered are due to compiling cuDecomp on a system without a CUDA driver installed. This is the source of the linking issues with libcuda.so
and libnvidia-ml
. To handle this use case, I needed to add links to CUDA stub libraries in the test/example compilation commands. I updated #5 to include these changes.
Could you try that updated branch with the suggested LD_LIBRARY_PATH
changes in my previous comment and see if that works for you (without any hacky workarounds 😄 )?
I thought that might not be the most straightforward way of doing it! Just tried with the updated PR and it works perfectly. I might get in touch with the HPC service desk to suggest adding the libraries to LD_LIBRARY_PATH
when the module is loaded, but from the cuDecomp
side, I'm happy for this issue to be closed. Thanks so much
Thanks a lot for the rapid feedback. Will close this issue when I merge the PR.
Feel free to reach out again if you have other issues integrating the library in your project.