marian-nmt / marian

Fast Neural Machine Translation in C++

Home Page:https://marian-nmt.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUDA error (illegal memory access) and loss being nan when training big transformer model

jorgtied opened this issue · comments

Bug description

Training breaks with

[2021-05-12 10:19:19] [training] skipping 250846-th update due to loss being nan
[2021-05-12 10:19:19] Error: CUDA error 700 'an illegal memory access was encountered' - /users/tiedeman/projappl/install/marian/src/tensors/gpu/cuda_helpers.h:67: cudaMemcpy(dest, start, (end - start) * sizeof(T), cudaMemcpyDefault)
[2021-05-12 10:19:19] Error: Aborted from void CudaCopy(const T*, const T*, T*) [with T = unsigned int] in /users/tiedeman/projappl/install/marian/src/tensors/gpu/cuda_helpers.h:67

when training big transformer models on NVIDIA v100.

How to reproduce

my command line:

marian --guided-alignment /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/train/opus.spm32k-spm32k.src-trg.alg.gz --early-stopping 10 --valid-freq 10000 --valid-sets /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/val/Tatoeba-dev.src.spm32k /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/val/Tatoeba-dev.trg.spm32k --valid-metrics perplexity --valid-mini-batch 16 --valid-log /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.transformer-big-align.valid1.log --beam-size 12 --normalize 1 --allow-unk --overwrite --keep-best --model /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.transformer-big-align.model1.npz --train-sets /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/train/opus.src.clean.spm32k.gz /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/train/opus.trg.clean.spm32k.gz --max-length 500 --vocabs /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.vocab.yml /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.vocab.yml --mini-batch-fit -w 24000 --maxi-batch 500 --save-freq 10000 --disp-freq 10000 --log /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.transformer-big-align.train1.log --type transformer --enc-depth 12 --dec-depth 6 --dim-emb 1024 --transformer-heads 16 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --fp16 --tied-embeddings-all --devices 0 1 2 3 --sync-sgd --seed 1111 --sqlite --tempdir /run/nvme/job_5803302/data --exponential-smoothing

Size of training data: ca 45 million sentence pairs. Training works fine with smaller transformer models on the same data set.

Context

  • Marian version: Paste the output of --version here
    v1.10.0 6f6d484 2021-02-06 15:35:16 -0800

  • CMake command: Type the cmake command you used and attach the output of --build-info all

AVX2_FOUND=true
AVX512_FOUND=true
AVX_FOUND=true
BUILD_ARCH=native
CMAKE_ADDR2LINE=/usr/bin/addr2line
CMAKE_AR=/usr/bin/ar
CMAKE_BUILD_TYPE=Release
CMAKE_COLOR_MAKEFILE=ON
CMAKE_CXX_COMPILER=/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin/g++
CMAKE_CXX_COMPILER_AR=/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin/gcc-ar
CMAKE_CXX_COMPILER_RANLIB=/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin/gcc-ranlib
CMAKE_CXX_FLAGS=-std=c++11 -pthread -Wl,--no-as-needed -fPIC -Wno-unused-result  -march=native  -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -DUSE_SENTENCEPIECE -DCUDA_FOUND -DUSE_NCCL -DMKL_ILP64 -m64
CMAKE_CXX_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_CXX_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_CXX_FLAGS_RELEASE=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_CXX_FLAGS_RELWITHDEBINFO=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_C_COMPILER=/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin/gcc
CMAKE_C_COMPILER_AR=/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin/gcc-ar
CMAKE_C_COMPILER_RANLIB=/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin/gcc-ranlib
CMAKE_C_FLAGS=-pthread -Wl,--no-as-needed -fPIC -Wno-unused-result  -march=native  -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -DMKL_ILP64 -m64
CMAKE_C_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_C_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_C_FLAGS_RELEASE=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_C_FLAGS_RELWITHDEBINFO=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_DLLTOOL=CMAKE_DLLTOOL-NOTFOUND
CMAKE_EXE_LINKER_FLAGS=-static-libgcc -static-libstdc++
CMAKE_INSTALL_BINDIR=bin
CMAKE_INSTALL_DATAROOTDIR=share
CMAKE_INSTALL_INCLUDEDIR=include
CMAKE_INSTALL_LIBDIR=lib64
CMAKE_INSTALL_LIBEXECDIR=libexec
CMAKE_INSTALL_LOCALSTATEDIR=var
CMAKE_INSTALL_OLDINCLUDEDIR=/usr/include
CMAKE_INSTALL_PREFIX=/users/tiedeman/projappl
CMAKE_INSTALL_SBINDIR=sbin
CMAKE_INSTALL_SHAREDSTATEDIR=com
CMAKE_INSTALL_SYSCONFDIR=etc
CMAKE_LINKER=/usr/bin/ld
CMAKE_MAKE_PROGRAM=/usr/bin/gmake
CMAKE_NM=/usr/bin/nm
CMAKE_OBJCOPY=/usr/bin/objcopy
CMAKE_OBJDUMP=/usr/bin/objdump
CMAKE_RANLIB=/usr/bin/ranlib
CMAKE_READELF=/usr/bin/readelf
CMAKE_SKIP_INSTALL_RPATH=NO
CMAKE_SKIP_RPATH=NO
CMAKE_STRIP=/usr/bin/strip
CMAKE_VERBOSE_MAKEFILE=FALSE
COMPILE_CPU=on
COMPILE_CUDA=ON
COMPILE_CUDA_SM35=ON
COMPILE_CUDA_SM50=ON
COMPILE_CUDA_SM60=ON
COMPILE_CUDA_SM70=ON
COMPILE_CUDA_SM75=ON
COMPILE_CUDA_SM80=ON
COMPILE_EXAMPLES=OFF
COMPILE_LIBRARY_ONLY=OFF
COMPILE_SERVER=OFF
COMPILE_TESTS=OFF
CUDA_64_BIT_DEVICE_CODE=ON
CUDA_ATTACH_VS_BUILD_RULE_TO_CUDA_FILE=ON
CUDA_BUILD_CUBIN=OFF
CUDA_BUILD_EMULATION=OFF
CUDA_CUDART_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libcudart_static.a
CUDA_CUDA_LIBRARY=CUDA_CUDA_LIBRARY-NOTFOUND
CUDA_HOST_COMPILATION_CPP=ON
CUDA_HOST_COMPILER=/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin/gcc
CUDA_NVCC_EXECUTABLE=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/bin/nvcc
CUDA_NVCC_FLAGS=-DUSE_SENTENCEPIECE-DCUDA_FOUND-DUSE_NCCL--default-streamper-thread-O3-g--use_fast_math-arch=sm_35-gencode=arch=compute_35,code=sm_35-gencode=arch=compute_50,code=sm_50-gencode=arch=compute_52,code=sm_52-gencode=arch=compute_60,code=sm_60-gencode=arch=compute_61,code=sm_61-gencode=arch=compute_70,code=sm_70-gencode=arch=compute_70,code=compute_70-gencode=arch=compute_75,code=sm_75-gencode=arch=compute_75,code=compute_75-gencode=arch=compute_80,code=sm_80-gencode=arch=compute_80,code=compute_80-ccbin/appl/spack/install-tree/gcc-4.8.5/gcc-9.1.0-vpjht2/bin/gcc-std=c++11-Xcompiler -fPIC-Xcompiler -Wno-unused-result-Xcompiler -Wno-deprecated-Xcompiler -Wno-pragmas-Xcompiler -Wno-unused-value-Xcompiler -Werror-Xcompiler -msse2-Xcompiler -msse3-Xcompiler -msse4.1-Xcompiler -msse4.2-Xcompiler -mavx-Xcompiler -mavx2-Xcompiler -mavx512f
CUDA_OpenCL_LIBRARY=CUDA_OpenCL_LIBRARY-NOTFOUND
CUDA_PROPAGATE_HOST_FLAGS=OFF
CUDA_SDK_ROOT_DIR=CUDA_SDK_ROOT_DIR-NOTFOUND
CUDA_SEPARABLE_COMPILATION=OFF
CUDA_TOOLKIT_INCLUDE=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/include
CUDA_TOOLKIT_ROOT_DIR=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2
CUDA_USE_STATIC_CUDA_RUNTIME=ON
CUDA_VERBOSE_BUILD=OFF
CUDA_VERSION=11.1
CUDA_cublasLt_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libcublasLt_static.a
CUDA_cublas_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libcublas_static.a
CUDA_cudadevrt_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libcudadevrt.a
CUDA_cudart_static_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libcudart_static.a
CUDA_cufft_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libcufft_static.a
CUDA_culibos_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libculibos.a
CUDA_cupti_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/extras/CUPTI/lib64/libcupti_static.a
CUDA_curand_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libcurand_static.a
CUDA_cusolver_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libcusolver_static.a
CUDA_cusparse_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libcusparse_static.a
CUDA_nppc_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppc_static.a
CUDA_nppial_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppial_static.a
CUDA_nppicc_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppicc_static.a
CUDA_nppidei_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppidei_static.a
CUDA_nppif_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppif_static.a
CUDA_nppig_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppig_static.a
CUDA_nppim_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppim_static.a
CUDA_nppist_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppist_static.a
CUDA_nppisu_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppisu_static.a
CUDA_nppitc_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnppitc_static.a
CUDA_npps_LIBRARY=/appl/spack/install-tree/gcc-9.1.0/cuda-11.1.0-vvfuk2/lib64/libnpps_static.a
CUDA_nvToolsExt_LIBRARY=CUDA_nvToolsExt_LIBRARY-NOTFOUND
CUDA_rt_LIBRARY=/usr/lib64/librt.a
DOXYGEN_DOT_EXECUTABLE=DOXYGEN_DOT_EXECUTABLE-NOTFOUND
DOXYGEN_EXECUTABLE=/usr/bin/doxygen
GENERATE_MARIAN_INSTALL_TARGETS=OFF
GIT_EXECUTABLE=/usr/bin/git
INTEL_ROOT=/opt/intel
INTGEMM_DONT_BUILD_TESTS=ON
MKL_CORE_LIBRARY=/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_core.a
MKL_INCLUDE_DIR=/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/include
MKL_INCLUDE_DIRS=/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/include
MKL_INTERFACE_LIBRARY=/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_intel_ilp64.a
MKL_LIBRARIES=-Wl,--start-group/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_intel_ilp64.a/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_sequential.a/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_core.a-Wl,--end-group
MKL_ROOT=/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl
MKL_SEQUENTIAL_LAYER_LIBRARY=/appl/opt/cluster_studio_xe2019/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_sequential.a
MPIEXEC_EXECUTABLE=/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/bin/mpiexec
MPIEXEC_MAX_NUMPROCS=40
MPIEXEC_NUMPROC_FLAG=-n
MPI_CXX_COMPILER=/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/bin/mpicxx
MPI_CXX_COMPILER_INCLUDE_DIRS=/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include/openmpi/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include/openmpi/opal/mca/event/libevent2022/libevent/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include/openmpi/opal/mca/event/libevent2022/libevent/include/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include
MPI_CXX_COMPILE_OPTIONS=-pthread
MPI_CXX_HEADER_DIR=/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include
MPI_CXX_LIB_NAMES=mpi
MPI_CXX_LINK_FLAGS=-Wl,-rpath -Wl,/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/lib -L/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/lib -pthread
MPI_CXX_SKIP_MPICXX=FALSE
MPI_C_COMPILER=/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/bin/mpicc
MPI_C_COMPILER_INCLUDE_DIRS=/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include/openmpi/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include/openmpi/opal/mca/hwloc/hwloc201/hwloc/include/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include/openmpi/opal/mca/event/libevent2022/libevent/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include/openmpi/opal/mca/event/libevent2022/libevent/include/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include
MPI_C_COMPILE_OPTIONS=-pthread
MPI_C_HEADER_DIR=/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/include
MPI_C_LIB_NAMES=mpi
MPI_C_LINK_FLAGS=-Wl,-rpath -Wl,/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/lib -L/appl/spack/install-tree/gcc-9.1.0/hpcx-mpi-2.4.0-dnpuei/lib -pthread
MPI_mpi_LIBRARY=/lib64/libexempi.so
Protobuf_INCLUDE_DIR=/projappl/project_2001194/usr/include
Protobuf_LIBRARY=/projappl/project_2001194/usr/lib/libprotobuf.so
Protobuf_LITE_LIBRARY_DEBUG=Protobuf_LITE_LIBRARY_DEBUG-NOTFOUND
Protobuf_LITE_LIBRARY_RELEASE=/projappl/project_2001194/usr/lib/libprotobuf-lite.so
Protobuf_PROTOC_EXECUTABLE=/projappl/project_2001194/usr/bin/protoc
Protobuf_PROTOC_LIBRARY_DEBUG=Protobuf_PROTOC_LIBRARY_DEBUG-NOTFOUND
Protobuf_PROTOC_LIBRARY_RELEASE=/projappl/project_2001194/usr/lib/libprotobuf.so
SPM_BUILD_TEST=OFF
SPM_COVERAGE=OFF
SPM_ENABLE_NFKC_COMPILE=OFF
SPM_ENABLE_SHARED=OFF
SPM_ENABLE_TCMALLOC=ON
SPM_ENABLE_TENSORFLOW_SHARED=OFF
SPM_NO_THREADLOCAL=OFF
SPM_TCMALLOC_STATIC=on
SPM_USE_BUILTIN_PROTOBUF=off
SQLITE_ENABLE_ASSERT_HANDLER=OFF
SQLITE_ENABLE_COLUMN_METADATA=ON
SQLITE_USE_LEGACY_STRUCT=OFF
SSE2_FOUND=true
SSE3_FOUND=true
SSE4_1_FOUND=true
SSE4_2_FOUND=true
SSSE3_FOUND=true
TCMALLOC_LIB=/projappl/project_2001194/lib
Tcmalloc_INCLUDE_DIR=/users/tiedeman/projappl/usr/include
Tcmalloc_LIBRARY=/users/tiedeman/projappl/usr/lib/libtcmalloc_minimal.a
Tcmalloc_ROOT=/projappl/project_2001194/usr
USE_APPLE_ACCELERATE=OFF
USE_CCACHE=OFF
USE_CUDNN=OFF
USE_DOXYGEN=ON
USE_FBGEMM=OFF
USE_MKL=ON
USE_MPI=on
USE_NCCL=ON
USE_OPENMP=OFF
USE_SENTENCEPIECE=on
USE_STATIC_LIBS=on
  • Log file: Attach your training/decoding logs
...
[2021-05-12 10:08:32] Loading model from /users/tiedeman/research/Opus-MT-train/work-tatoeba/fin-eng/opus.spm32k-spm32k.transformer-big-align.model1.npz
[2021-05-12 10:08:35] [memory] Reserving 906 MB, device cpu0
[2021-05-12 10:08:35] [memory] Reserving 226 MB, device gpu0
[2021-05-12 10:08:35] [memory] Reserving 226 MB, device gpu1
[2021-05-12 10:08:35] [memory] Reserving 226 MB, device gpu2
[2021-05-12 10:08:35] [memory] Reserving 226 MB, device gpu3
[2021-05-12 10:19:17] [training] skipping 250843-th update due to loss being nan
[2021-05-12 10:19:18] [training] skipping 250844-th update due to loss being nan
[2021-05-12 10:19:18] [training] skipping 250845-th update due to loss being nan
[2021-05-12 10:19:19] [training] skipping 250846-th update due to loss being nan
[2021-05-12 10:19:19] Error: CUDA error 700 'an illegal memory access was encountered' - /users/tiedeman/projappl/install/marian/src/tensors/gpu/cuda_helpers.h:67: cudaMemcpy(dest, start, (end - start) * sizeof(T), cudaMemcpyDefault)
[2021-05-12 10:19:19] Error: Aborted from void CudaCopy(const T*, const T*, T*) [with T = unsigned int] in /users/tiedeman/projappl/install/marian/src/tensors/gpu/cuda_helpers.h:67
CUDA error 700 'an illegal memory access was encountered' - /users/tiedeman/projappl/install/marian/src/tensors/gpu/algorithm.cu:54: cudaStreamSynchronize(0)
Aborted from void marian::gpu::fill(marian::Ptr<marian::Backend>, T*, T*, T) [with T = float; marian::Ptr<marian::Backend> = std::shared_ptr<marian::Backend>] in /users/tiedeman/projappl/install/marian/src/tensors/gpu/algorithm.cu:54
CUDA error 700 'an illegal memory access was encountered' - /users/tiedeman/projappl/install/marian/src/tensors/gpu/algorithm.cu:54: cudaStreamSynchronize(0)
Aborted from void marian::gpu::fill(marian::Ptr<marian::Backend>, T*, T*, T) [with T = float; marian::Ptr<marian::Backend> = std::shared_ptr<marian::Backend>] in /users/tiedeman/projappl/install/marian/src/tensors/gpu/algorithm.cu:54
CUDA error 700 'an illegal memory access was encountered' - /users/tiedeman/projappl/install/marian/src/tensors/gpu/algorithm.cu:54: cudaStreamSynchronize(0)
Aborted from void marian::gpu::fill(marian::Ptr<marian::Backend>, T*, T*, T) [with T = float; marian::Ptr<marian::Backend> = std::shared_ptr<marian::Backend>] in /users/tiedeman/projappl/install/marian/src/tensors/gpu/algorithm.cu:54

[CALL STACK]
[0x1b0e567]         void marian::gpu::  fill  <float>(std::shared_ptr<marian::Backend>,  float*,  float*,  float) + 0x627
[0x1389e3d]         void marian::TensorBase::  set  <float>(float)     + 0x35d
[0x157a7ca]                                                           
[0x157e463]         marian::inits::LambdaInit::  apply  (IntrusivePtr<marian::TensorBase>) + 0x33
[0x157443f]         marian::ConstantNode::  init  ()                   + 0x3f
[0x1565b2d]         marian::ExpressionGraph::  forward  (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>&,  bool) + 0x5d
[0x15672f5]         marian::ExpressionGraph::  forwardNext  ()         + 0x2c5
[0x1734548]                                                           
[0x17cb2b4]         marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}::  operator()  () const + 0x54
[0x17cbdb0]         std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x20
[0x12e032b]         std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x1b
[0x7f694073d20b]                                                       + 0x620b
[0x17c0e38]         std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x108
[0x12e1d27]         std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x157
[0x504ab20]                                                           
[0x7f694073eea5]                                                       + 0x7ea5
[0x7f69401658cd]    clone                                              + 0x6d


[CALL STACK]
[0x1b0e567]         void marian::gpu::  fill  <float>(std::shared_ptr<marian::Backend>,  float*,  float*,  float) + 0x627
[0x1389e3d]         void marian::TensorBase::  set  <float>(float)     + 0x35d
[0x157a7ca]                                                           
[0x157e463]         marian::inits::LambdaInit::  apply  (IntrusivePtr<marian::TensorBase>) + 0x33
[0x157443f]         marian::ConstantNode::  init  ()                   + 0x3f
[0x1565b2d]         marian::ExpressionGraph::  forward  (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>&,  bool) + 0x5d
[0x15672f5]         marian::ExpressionGraph::  forwardNext  ()         + 0x2c5
[0x1734548]                                                           
[0x17cb2b4]         marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}::  operator()  () const + 0x54
[0x17cbdb0]         std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x20
[0x12e032b]         std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x1b
[0x7f694073d20b]                                                       + 0x620b
[0x17c0e38]         std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x108
[0x12e1d27]         std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x157
[0x504ab20]                                                           
[0x7f694073eea5]                                                       + 0x7ea5
[0x7f69401658cd]    clone                                              + 0x6d


[CALL STACK]
[0x1b0e567]         void marian::gpu::  fill  <float>(std::shared_ptr<marian::Backend>,  float*,  float*,  float) + 0x627
[0x1389e3d]         void marian::TensorBase::  set  <float>(float)     + 0x35d
[0x157a7ca]                                                           
[0x157e463]         marian::inits::LambdaInit::  apply  (IntrusivePtr<marian::TensorBase>) + 0x33
[0x157443f]         marian::ConstantNode::  init  ()                   + 0x3f
[0x1565b2d]         marian::ExpressionGraph::  forward  (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>&,  bool) + 0x5d
[0x15672f5]         marian::ExpressionGraph::  forwardNext  ()         + 0x2c5
[0x1734548]                                                           
[0x17cb2b4]         marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}::  operator()  () const + 0x54
[0x17cbdb0]         std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x20
[0x12e032b]         std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x1b
[0x7f694073d20b]                                                       + 0x620b
[0x17c0e38]         std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x108
[0x12e1d27]         std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x157
[0x504ab20]                                                           
[0x7f694073eea5]                                                       + 0x7ea5
[0x7f69401658cd]    clone                                              + 0x6d


[CALL STACK]
[0x1afe617]         void  CudaCopy  <unsigned int>(unsigned int const*,  unsigned int const*,  unsigned int*) + 0x3f7
[0x1aff02e]         void marian::gpu::  copy  <unsigned int>(std::shared_ptr<marian::Backend>,  unsigned int const*,  unsigned int const*,  unsigned int*) + 0x45e
[0x1587e63]         void marian::TensorBase::  set  <unsigned int>(unsigned int const*,  unsigned int const*) + 0x6f3
[0x15880f0]         std::_Function_handler<void (IntrusivePtr<marian::TensorBase>),marian::inits::fromVector<unsigned int>(std::vector<unsigned int,std::allocator<unsigned int>> const&)::{lambda(IntrusivePtr<marian::TensorBase>)#1}>::  _M_invoke  (std::_Any_data const&,  IntrusivePtr<marian::TensorBase>&&) + 0x20
[0x1581137]         marian::inits::LambdaInitConvert::  apply  (IntrusivePtr<marian::TensorBase>) + 0x67
[0x157443f]         marian::ConstantNode::  init  ()                   + 0x3f
[0x1565b2d]         marian::ExpressionGraph::  forward  (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>&,  bool) + 0x5d
[0x15672f5]         marian::ExpressionGraph::  forwardNext  ()         + 0x2c5
[0x1734548]                                                           
[0x17cb2b4]         marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1}::  operator()  () const + 0x54
[0x17cbdb0]         std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x20
[0x12e032b]         std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x1b
[0x7f694073d20b]                                                       + 0x620b
[0x17c0e38]         std::_Function_handler<void (),marian::ThreadPool::enqueue<std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&>(std::function<void (unsigned long,unsigned long,unsigned long)> const&,unsigned long&,unsigned long&,unsigned long&)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x108
[0x12e1d27]         std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x157
[0x504ab20]                                                           
[0x7f694073eea5]                                                       + 0x7ea5
[0x7f69401658cd]    clone                                              + 0x6d