marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository

Home Page:https://marian-nmt.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't build or run marian after a libcublas10 update

rihardsk opened this issue · comments

Bug description

After updating to libcublas10 version 10.2.3.254-1, Marian can no longer locate libcublas.so.10 on it's own:

$ ldd marian-decoder
	linux-vdso.so.1 (0x00007fffab1f4000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f73a6071000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f73a5e6d000)
	libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f73a1e0c000)
	libcusparse.so.10 => /usr/local/cuda/lib64/libcusparse.so.10 (0x00007f739ab85000)
	libcublas.so.10 => not found
	libtcmalloc_minimal.so.4 => /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 (0x00007f739a93a000)
	libcrypto.so.1.1 => /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 (0x00007f739a46f000)
	libboost_system.so.1.65.1 => /usr/lib/x86_64-linux-gnu/libboost_system.so.1.65.1 (0x00007f739a26a000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f739a04b000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f7399c3e000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f73998a0000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f7399688000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7399297000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f73ab9bd000)

On a system with libcublas10 version 10.2.2.89-1, everything's fine

$ ldd marian-decoder
	linux-vdso.so.1 (0x00007ffe433d0000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f97d97b7000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f97d95b3000)
	libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f97d5552000)
	libcusparse.so.10 => /usr/local/cuda/lib64/libcusparse.so.10 (0x00007f97ce2cb000)
	libcublas.so.10 => /usr/lib/x86_64-linux-gnu/libcublas.so.10 (0x00007f97ca015000)
	libtcmalloc_minimal.so.4 => /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 (0x00007f97c9dca000)
	libcrypto.so.1.1 => /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 (0x00007f97c98ff000)
	libboost_system.so.1.65.1 => /usr/lib/x86_64-linux-gnu/libboost_system.so.1.65.1 (0x00007f97c96fa000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f97c94db000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f97c9152000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f97c8db4000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f97c8b9c000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f97c87ab000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f97df103000)
	libcublasLt.so.10 => /usr/lib/x86_64-linux-gnu/libcublasLt.so.10 (0x00007f97c6918000)

This appears to be caused by changes in the libcublas10 package

10.2.3.254-1:

dpkg -L libcublas10
/.
/usr
/usr/local
/usr/local/cuda-10.2
/usr/local/cuda-10.2/targets
/usr/local/cuda-10.2/targets/x86_64-linux
/usr/local/cuda-10.2/targets/x86_64-linux/lib
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcublas.so.10.2.3.254
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcublasLt.so.10.2.3.254
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvblas.so.10.2.3.254
/usr/share
/usr/share/doc
/usr/share/doc/libcublas10
/usr/share/doc/libcublas10/changelog.Debian.gz
/usr/share/doc/libcublas10/copyright
/usr/local/cuda-10.2/lib64
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcublas.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcublasLt.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvblas.so.10

10.2.2.89-1:

dpkg -L libcublas10
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/libcublas10
/usr/share/doc/libcublas10/changelog.Debian.gz
/usr/lib
/usr/lib/x86_64-linux-gnu
/usr/lib/x86_64-linux-gnu/libcublasLt.so.10.2.2.89
/usr/lib/x86_64-linux-gnu/libnvblas.so.10.2.2.89
/usr/lib/x86_64-linux-gnu/libcublas.so.10.2.2.89
/usr/lib/x86_64-linux-gnu/libnvblas.so.10
/usr/lib/x86_64-linux-gnu/libcublas.so.10
/usr/lib/x86_64-linux-gnu/libcublasLt.so.10

The /usr/lib/x86_64-linux-gnu is removed in the newer version and none of the libcublas.so files are located in any of the default shared library search paths.

This concerns building as well. Running cmake on the latest master on the system with libcublas10 version 10.2.3.254-1 gives

$ cmake ..
-- The CXX compiler identification is GNU 7.5.0
-- The C compiler identification is GNU 7.5.0
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Project name: marian
-- Project version: v1.10.25;+ab6b8260
Submodule 'examples' (https://github.com/marian-nmt/marian-examples) registered for path 'examples'
Submodule 'regression-tests' (https://github.com/marian-nmt/marian-regression-tests) registered for path 'regression-tests'
Submodule 'src/3rd_party/fbgemm' (https://github.com/marian-nmt/FBGEMM) registered for path 'src/3rd_party/fbgemm'
Submodule 'src/3rd_party/intgemm' (https://github.com/marian-nmt/intgemm/) registered for path 'src/3rd_party/intgemm'
Submodule 'src/3rd_party/nccl' (https://github.com/marian-nmt/nccl) registered for path 'src/3rd_party/nccl'
Submodule 'src/3rd_party/sentencepiece' (https://github.com/marian-nmt/sentencepiece) registered for path 'src/3rd_party/sentencepiece'
Submodule 'src/3rd_party/simple-websocket-server' (https://github.com/marian-nmt/Simple-WebSocket-Server) registered for path 'src/3rd_party/simple-websocket-server'
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/examples'...
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/regression-tests'...
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/src/3rd_party/fbgemm'...
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/src/3rd_party/intgemm'...
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/src/3rd_party/nccl'...
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/src/3rd_party/sentencepiece'...
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/src/3rd_party/simple-websocket-server'...
Submodule path 'examples': checked out '6d5921cc7de91f4e915b59e9c52c9a76c4e99b00'
Submodule path 'regression-tests': checked out '32a2f7960d8cc48d6c90cbb5d03fbb42eb923d3d'
Submodule path 'src/3rd_party/fbgemm': checked out '6f45243cb8ab7d7ab921af18d313ae97144618b8'
Submodule 'third_party/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'src/3rd_party/fbgemm/third_party/asmjit'
Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'src/3rd_party/fbgemm/third_party/cpuinfo'
Submodule 'third_party/googletest' (https://github.com/google/googletest) registered for path 'src/3rd_party/fbgemm/third_party/googletest'
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/src/3rd_party/fbgemm/third_party/asmjit'...
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/src/3rd_party/fbgemm/third_party/cpuinfo'...
Cloning into '/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/src/3rd_party/fbgemm/third_party/googletest'...
Submodule path 'src/3rd_party/fbgemm/third_party/asmjit': checked out '4da474ac9aa2689e88d5e40a2f37628f302d7e3c'
Submodule path 'src/3rd_party/fbgemm/third_party/cpuinfo': checked out 'd5e37adf1406cf899d7d9ec1d317c47506ccb970'
Submodule path 'src/3rd_party/fbgemm/third_party/googletest': checked out '0fc5466dbb9e623029b1ada539717d10bd45e99e'
Submodule path 'src/3rd_party/intgemm': checked out '8abde25b13c3ab210c0dec8e23f4944e3953812d'
Submodule path 'src/3rd_party/nccl': checked out '5dcf7751494f9d04057bfc6b4a2b64611bc12253'
Submodule path 'src/3rd_party/sentencepiece': checked out 'c307b874deb5ea896db8f93506e173353e66d4d3'
Submodule path 'src/3rd_party/simple-websocket-server': checked out '1d7e84aeb3f1ebdc78f6965d79ad3ca3003789fe'
CMake Warning at CMakeLists.txt:74 (message):
  CMAKE_BUILD_TYPE not set; setting to Release


-- Building with -march=native and intrinsics will be chosen automatically by the compiler to match the current machine.
-- Checking support for CPU intrinsics
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found suitable version "10.1", minimum required is "9.0") 
-- Compiling code for Pascal GPUs
-- Compiling code for Volta GPUs
-- Compiling code for Turing GPUs
-- Found CUDA libraries: /usr/local/cuda/lib64/libcurand.so;/usr/local/cuda/lib64/libcusparse.so;CUDA_cublas_LIBRARY-NOTFOUND
-- Found Tcmalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so
-- Found MKL: -Wl,--start-group;/opt/intel/mkl/lib/intel64/libmkl_intel_ilp64.a;/opt/intel/mkl/lib/intel64/libmkl_sequential.a;/opt/intel/mkl/lib/intel64/libmkl_core.a;-Wl,--end-group  
CMake Warning at src/3rd_party/intgemm/CMakeLists.txt:33 (message):
  Not building AVX512VNNI-based multiplication because your compiler is
  too old.

  For details rerun cmake with --debug-trycompile then try to build in
  compile_tests/CMakeFiles/CMakeTmp.


-- VERSION: 0.1.94
-- Found TCMalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so
-- Found Doxygen: /usr/bin/doxygen (found version "1.8.13") found components:  doxygen dot 
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDA_cublas_LIBRARY (ADVANCED)
    linked by target "marian" in directory /home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/src

-- Configuring incomplete, errors occurred!
See also "/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/build/CMakeFiles/CMakeOutput.log".
See also "/home/TILDE.LV/rihards.krislauks/prog/cpp/marian-ld-test/build/CMakeFiles/CMakeError.log".

I know I can work around this by setting LD_LIBRARY_PATH but I'm curious what's a proper solution supposed to be here. It's weird that libcublas10 is now packaged in way that avoids using any of the default shared library search paths.

How to reproduce

Update libcublas10 to version 10.2.3.254-1 and try to run marian-decoder or build the project.

Setting LD_LIBRARY_PATH actually doesn't help when running cmake (which makes sense). Currently, I'm unable to build Marian with the updated libcublas10 library. Is there a way to show cmake where to look for cublas?