eth-cscs / stackinator

Home Page:https://eth-cscs.github.io/stackinator/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Utopia stack with Trilinos+CUDA: cudaErrorUnsupportedPtxVersion

edopao opened this issue · comments

commented

I have built a software stack for Utopia on Clariden using this recipe:
https://github.com/edopao/utopia-recipe/blob/ede5c35792e12c4e8a0c46918846dbc543e5665d/environments.yaml

This recipe enables the CUDA variant on all packages, with cuda_arch=80. Here is the concretisation result for Trilinos:

==> Concretized cuda@11.8                                                                                                                                              
 -   3xr57ku  cuda@11.8.0%gcc@11.3.0~allow-unsupported-compilers~dev build_system=generic arch=linux-sles15-zen3

==> Concretized trilinos@13.4.0+amesos2+belos~epetra+intrepid2+mumps+nox+openmp+shards+suite-sparse+superlu-dist cxxstd=17                                             
 -   o23zzjq  trilinos@13.4.0%gcc@11.3.0~adelus~adios2~amesos+amesos2+anasazi~aztec~basker+belos~boost~chaco~complex+cuda~cuda_rdc~debug~dtk~epetra~epetraext~epetraextbtf~epetraextexperimental~epetraextgraphreorderings~exodus+explicit_template_instantiation~float+fortran~gtest~hdf5~hypre~ifpack+ifpack2~intrepid+intrepid2~ipo~isorropia+kokkos~mesquite~minitensor~ml+mpi+muelu+mumps+nox+openmp~panzer~phalanx~piro~python~rocm~rocm_rdc~rol~rythmos+sacado~scorec+shards+shared~shylu~stk~stokhos~stratimikos~strumpack+suite-sparse~superlu+superlu-dist~teko~tempus~thyra+tpetra~trilinoscouplings~uvm+wrapper~x11~zoltan~zoltan2 build_system=cmake build_type=RelWithDebInfo cuda_arch=80 cxxstd=17 gotype=long_long arch=linux-sles15-zen3

After building Utopia in the above user environment, I get a CUDA runtime error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorUnsupportedPtxVersion): the provided PTX was compiled with an unsupported toolchain. /tmp/epaone/spack-stage/spack-stage-trilinos-13.4.0-o23zzjqfcj6fo55x4rqqvihjdklmo6dv/spack-src/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:151

Here is the output of ldd command for reference:

$ ldd utopia_test | grep cuda
	libcudart.so.11.0 => /user-environment/linux-sles15-zen3/gcc-11.3.0/cuda-11.8.0-3xr57kuw4q4cw53rscdnvqyjorpqamnp/lib64/libcudart.so.11.0 (0x00007fc4a0630000)
	libnvToolsExt.so.1 => /user-environment/linux-sles15-zen3/gcc-11.3.0/cuda-11.8.0-3xr57kuw4q4cw53rscdnvqyjorpqamnp/lib64/libnvToolsExt.so.1 (0x00007fc4a0426000)
	libcufft.so.10 => /user-environment/linux-sles15-zen3/gcc-11.3.0/cuda-11.8.0-3xr57kuw4q4cw53rscdnvqyjorpqamnp/lib64/libcufft.so.10 (0x00007fc48f54b000)
	libcublas.so.11 => /user-environment/linux-sles15-zen3/gcc-11.3.0/cuda-11.8.0-3xr57kuw4q4cw53rscdnvqyjorpqamnp/lib64/libcublas.so.11 (0x00007fc4898ed000)
	libcusparse.so.11 => /user-environment/linux-sles15-zen3/gcc-11.3.0/cuda-11.8.0-3xr57kuw4q4cw53rscdnvqyjorpqamnp/lib64/libcusparse.so.11 (0x00007fc478bf5000)
	libcusolver.so.11 => /user-environment/linux-sles15-zen3/gcc-11.3.0/cuda-11.8.0-3xr57kuw4q4cw53rscdnvqyjorpqamnp/lib64/libcusolver.so.11 (0x00007fc46693d000)
	libcurand.so.10 => /user-environment/linux-sles15-zen3/gcc-11.3.0/cuda-11.8.0-3xr57kuw4q4cw53rscdnvqyjorpqamnp/lib64/libcurand.so.10 (0x00007fc460061000)
	libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007fc45e831000)
	libmpi_gtl_cuda.so.0 => /user-environment/linux-sles15-zen3/gcc-11.3.0/cray-mpich-8.1.24-gcc-fwf2cccra3y3lxkzw7kvqjyvwfipin4i/lib/libmpi_gtl_cuda.so.0 (0x00007fc45b7ba000)
	libcublasLt.so.11 => /user-environment/linux-sles15-zen3/gcc-11.3.0/cuda-11.8.0-3xr57kuw4q4cw53rscdnvqyjorpqamnp/lib64/libcublasLt.so.11 (0x00007fc411be9000)

Hello @edopao , did you ever resolve this issue?

commented

No, the issue is still there. I have tried again today and I have noticed that the default cuda architecture in the generated nvcc_wrapper is incorrect:
/user-environment/linux-sles15-zen3/gcc-11.3.0/trilinos-13.4.0-o23zzjqfcj6fo55x4rqqvihjdklmo6dv/bin/nvcc_wrapper
default_arch="sm_35"

In a local Trilinos installation on Daint, I see the correct cuda architecture for the target gpu architecture, since this script is generated when Trilinos is built for the target node.

Is there a way to configure the Trilinos build system to use "sm_80"?

commented

Yes, that should be done by the variants +cuda cuda_arch=80, which seems to be taken in the concretise step.

==> Concretized trilinos@13.4.0+amesos2+belos~epetra+intrepid2+mumps+nox+openmp+shards+suite-sparse+superlu-dist cxxstd=17                                             
 -   o23zzjq  trilinos@13.4.0%gcc@11.3.0~adelus~adios2~amesos+amesos2+anasazi~aztec~basker+belos~boost~chaco~complex+cuda~cuda_rdc~debug~dtk~epetra~epetraext~epetraextbtf~epetraextexperimental~epetraextgraphreorderings~exodus+explicit_template_instantiation~float+fortran~gtest~hdf5~hypre~ifpack+ifpack2~intrepid+intrepid2~ipo~isorropia+kokkos~mesquite~minitensor~ml+mpi+muelu+mumps+nox+openmp~panzer~phalanx~piro~python~rocm~rocm_rdc~rol~rythmos+sacado~scorec+shards+shared~shylu~stk~stokhos~stratimikos~strumpack+suite-sparse~superlu+superlu-dist~teko~tempus~thyra+tpetra~trilinoscouplings~uvm+wrapper~x11~zoltan~zoltan2 build_system=cmake build_type=RelWithDebInfo cuda_arch=80 cxxstd=17 gotype=long_long arch=linux-sles15-zen3

It is very strange that it does not take effect.

commented

The last comment I wrote is probably not relevant. When I use nvcc_wrapper from /user-environment, I see that the correct cuda_arch is set on the compile line:
/user-environment/.../nvcc_wrapper ... -arch=sm_80 ... myfile.cpp
That cuda arch should override whatever cuda arch is set as default in nvcc_wrapper.

commented

If I try to compile a simple cuda program with nvcc_wrapper it works:

$ module use /user-environment/modules/
$ module load cuda trilinos
$ which nvcc_wrapper
/user-environment/linux-sles15-zen3/gcc-11.3.0/trilinos-13.4.0-o23zzjqfcj6fo55x4rqqvihjdklmo6dv/bin/nvcc_wrapper
$ nvcc_wrapper hello.cu -o hello -arch sm_80
$ ./hello 
Hello World from GPU!

So it is probably not an issue to be handled here, we can close this issue.

commented

I found a solution to this issue. The CUDA driver installed on clariden/hohgant is from CUDA version 11.6:

$ srun -N1 --partition=nvgpu nvidia-smi
Fri Jun 16 14:45:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

The utopia-recipe environment on master branch specifies CUDA 11.8, which was inspired by Stackinator examples. CUDA 11.8 is needed to support Hopper GPUs, but clariden and hohgant nodes only have Ampere GPUs, which explains why the driver installed on these nodes is from CUDA 11.6.
I have created a system-hohgant branch on utopia-recipe repository to build a user environment with CUDA 11.6. This image works fine, no PTX version mismatch is observed.

commented

Adding some reference from https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application-considerations

Applications using PTX will see runtime issues
Applications that compile device code to PTX will not work on older drivers. If the application requires PTX then admins have to upgrade the installed driver.
PTX Developers should refer to the CUDA Compatibility Developers Guide and PTX programming guide in the CUDA C++ Programming Guide for details on this limitation.