dmlc / MXNet.jl

MXNet Julia Package - flexible and efficient deep learning in Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compiler bus error during build

davidssmith opened this issue · comments

I'm getting a compiler bus error during build, and I really don't know where to start debugging it. I have tried nuking the package directory and rebuilding, and I've tried a different compiler, but I got bus errors on both, so I'm thinking it's an MXNet issue.

g++ -std=c++11 -c -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -O3 -DNDEBUG=1 -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/mshadow/ -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/dmlc-core/include -fPIC -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/nnvm/include -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/dlpack/include -Iinclude -funroll-loops -Wno-unused-variable -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -DINTERFACE64 -msse3 -I/opt/easybuild/software/Core/CUDA/8.0.61/include -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=0 -fopenmp -DMSHADOW_USE_CUDNN=1 -DMXNET_USE_LAPACK -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/cub -DMXNET_USE_LIBJPEG_TURBO=0 -MMD -c src/operator/contrib/multibox_detection.cc -o build/src/operator/contrib/multibox_detection.o
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/deformable_convolution.o] Error 4
make: *** Waiting for unfinished jobs....
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/dequantize.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/deformable_psroi_pooling.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/multibox_detection.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/count_sketch.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/ifft.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/ctc_loss.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/fft.o] Error 4
================================================[ ERROR: MXNet ]=================================================

LoadError: failed process: Process(`make -j8 USE_BLAS=openblas 'MSHADOW_LDFLAGS=-lm /gpfs22/home/dss/julia-d386e40c17/bin/../lib/julia/libopenblas64_.so'`, ProcessExited(2)) [2]
while loading /gpfs22/home/dss/.julia/v0.6/MXNet/deps/build.jl, in expression starting on line 81

=================================================================================================================

shell> g++ --version
g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

gcc 4.4 is quite old.

I've tried a different compiler

which one have you tried?
gcc 5.x and 6.x work for me.

I don't have root, so I can only load modules on this cluster. I have gcc 5 in my path:

[dss@gpu0025 ~]$ which gcc
/opt/easybuild/software/Core/GCCcore/5.4.0/bin/gcc
[dss@gpu0025 ~]$ gcc --version
gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[dss@gpu0025 ~]$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

shell> gcc --version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

julia> ENV["PATH"]
"/opt/easybuild/software/Compiler/GCC/5.4.0-2.26/LLVM/3.9.0/bin:/opt/easybuild/software/Compiler/GCC/5.4.0-2.26/git/2.12.2/bin:/opt/easybuild/software/Compiler/GCC/5.4.0-2.26/Perl/5.24.0/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/gettext/0.19.8/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/ncurses/6.0/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/libxml2/2.9.4/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/XZ/5.2.2/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/expat/2.2.0/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/cURL/7.49.1/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/binutils/2.26/bin:/opt/easybuild/software/Core/GCCcore/5.4.0/bin:/opt/easybuild/software/Core/CUDA/8.0.61:/opt/easybuild/software/Core/CUDA/8.0.61/bin:/usr/scheduler/slurm/sbin:/usr/scheduler/slurm/bin:/usr/lpp/mmfs/bin:/usr/local/bin:/usr/local/common/bin:/usr/bin:/bin:/usr/scheduler/slurm/sbin:/usr/scheduler/slurm/bin:/usr/lpp/mmfs/bin:/usr/local/bin:/usr/local/common/bin:/usr/bin:/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/var/cfengine/bin:/home/dss/bin:/var/cfengine/bin"

but the build script is not finding it and is instead using /usr/bin/gcc which is version 4.

How can I override the gcc used by the build?

You can change the CC and CXX in this file

~/.julia/v0.6/MXNet/deps/src/mxnet/make/config.mk

maybe setting it to /opt/easybuild/software/Core/GCCcore/5.4.0/bin/gcc and /opt/easybuild/software/Core/GCCcore/5.4.0/bin/g++

I will add a patch to allow user config it from Julia's REPL later.

It compiles now, but when I start Julia and issue using MXNet at the REPL, it crashes the REPL. I'm looking into it to make sure I applied the patch correctly.

Here is the beginning of the error message, just in case it helps.

               _                                                                                                              [51/1921]
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using MXNet

signal (11): Segmentation fault
while loading no file, in expression starting on line 0
free at /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (unknown line)
_ZN5mxnet2op12OperatorTuneIlE8demangleB5cxx11EPKc at /home/dss/.julia/v0.6/MXNet/deps/usr/lib/libmxnet.so (unknown line)
unknown function (ip: 0x7f18050c58bd)
unknown function (ip: 0x7f184d9f9ad9)
unknown function (ip: 0x7f184d9f9bea)
unknown function (ip: 0x7f184d9febf5)
_dl_catch_error at /build/glibc-itYbWN/glibc-2.26/elf/dl-error-skeleton.c:198
unknown function (ip: 0x7f184d9fe148)
dlopen_doit at /build/glibc-itYbWN/glibc-2.26/dlfcn/dlopen.c:66
_dl_catch_error at /build/glibc-itYbWN/glibc-2.26/elf/dl-error-skeleton.c:198
_dlerror_run at /build/glibc-itYbWN/glibc-2.26/dlfcn/dlerror.c:163
__dlopen at /build/glibc-itYbWN/glibc-2.26/dlfcn/dlopen.c:87
jl_load_dynamic_library_ at /buildworker/worker/package_linux64/build/src/dlload.c:189
jl_get_library at /buildworker/worker/package_linux64/build/src/runtime_ccall.cpp:159
emit_a_ccall at /buildworker/worker/package_linux64/build/src/ccall.cpp:2074
emit_ccall at /buildworker/worker/package_linux64/build/src/ccall.cpp:1899
emit_expr at /buildworker/worker/package_linux64/build/src/codegen.cpp:4156
emit_assignment at /buildworker/worker/package_linux64/build/src/codegen.cpp:3853 [inlined]
emit_expr at /buildworker/worker/package_linux64/build/src/codegen.cpp:4159
emit_stmtpos at /buildworker/worker/package_linux64/build/src/codegen.cpp:4064 [inlined]
emit_function at /buildworker/worker/package_linux64/build/src/codegen.cpp:6248
jl_compile_linfo at /buildworker/worker/package_linux64/build/src/codegen.cpp:1256
emit_invoke at /buildworker/worker/package_linux64/build/src/codegen.cpp:3400 [inlined]
emit_expr at /buildworker/worker/package_linux64/build/src/codegen.cpp:4135
emit_stmtpos at /buildworker/worker/package_linux64/build/src/codegen.cpp:4064 [inlined]
emit_function at /buildworker/worker/package_linux64/build/src/codegen.cpp:6248
jl_compile_linfo at /buildworker/worker/package_linux64/build/src/codegen.cpp:1256

Oh...jemalloc, I ran into similar issue on Arch Linux.
You can try to disable it in ~/.julia/v0.6/MXNet/deps/src/mxnet/make/config.mk
(note that it's not ~/.julia/v0.6/MXNet/deps/src/mxnet/config.mk, this file will be override by build.jl)

set USE_JEMALLOC to 0.

Success! I wasn't able to change that file without git complaining that I need to stash, so I uninstalled jemalloc and was able to compile. Now all but one test passes. I doubt it is related, but I'm including the error message in case it is.

SymbolicNode Test: Error During Test                                  
  Got an exception of type MXNet.mx.MXError outside of a @test        
  Cannot find argument 'a', Possible Arguments:                       
  ----------------                                                    
  kernel : Shape(tuple), required                                     
      Convolution kernel size: (w,), (h, w) or (d, h, w)              
  stride : Shape(tuple), optional, default=[]                         
      Convolution stride: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
  dilate : Shape(tuple), optional, default=[]                         
      Convolution dilate: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
  pad : Shape(tuple), optional, default=[]                            
      Zero pad for convolution: (w,), (h, w) or (d, h, w). Defaults to no padding.
  num_filter : int (non-negative), required                           
      Convolution filter(channel) number                              
  num_group : int (non-negative), optional, default=1                 
      Number of group partitions.                                     
  workspace : long (non-negative), optional, default=1024             
      Maximum temporary workspace allowed (MB) in convolution.This parameter has two usages. When CUDNN is not used, it determines the effect
ive batch size of the convolution kernel. When CUDNN is used, it controls the maximum temporary storage used for tuning the best CUDNN kernel
 when `limited_workspace` strategy is used.                                                                                                  
  no_bias : boolean, optional, default=0                                                                                                     
      Whether to disable bias parameter.                                                                                                     
  cudnn_tune : {None, 'fastest', 'limited_workspace', 'off'},optional, default='None'
      Whether to pick convolution algo by running performance test.                                                                            cudnn_off : boolean, optional, default=0                                                                                                         Turn off cudnn for this layer.  
  layout : {None, 'NCDHW', 'NCHW', 'NCW', 'NDHWC', 'NHWC'},optional, default='None'
      Set layout for input, output and weight. Empty for                                                                                     
      default layout: NCW for 1d, NCHW for 2d and NCDHW for 3d.
  , in operator Convolution(name="", a="a", kernel="(1, 1)", num_filter="1")
  Stacktrace:                      
   [1] macro expansion at /home/dss/.julia/v0.6/MXNet/src/base.jl:77 [inlined]
   [2] set_attr(::MXNet.mx.SymbolicNode, ::Symbol, ::String) at /home/dss/.julia/v0.6/MXNet/src/symbolic-node.jl:232
   [3] _create_atomic_symbol at /home/dss/.julia/v0.6/MXNet/src/symbolic-node.jl:825 [inlined]
   [4] #Convolution#5492(::Array{Any,1}, ::Function, ::Type{MXNet.mx.SymbolicNode}, ::MXNet.mx.SymbolicNode, ::Vararg{MXNet.mx.SymbolicNode,N
} where N) at /home/dss/.julia/v0.6/MXNet/src/symbolic-node.jl:903
   [5] (::MXNet.mx.#kw##Convolution)(::Array{Any,1}, ::MXNet.mx.#Convolution, ::Type{MXNet.mx.SymbolicNode}, ::MXNet.mx.SymbolicNode, ::Varar
g{MXNet.mx.SymbolicNode,N} where N) at ./<missing>:0
   [6] #Convolution#5496(::Array{Any,1}, ::Function, ::MXNet.mx.SymbolicNode, ::Vararg{MXNet.mx.SymbolicNode,N} where N) at /home/dss/.julia/
v0.6/MXNet/src/symbolic-node.jl:924
   [7] (::MXNet.mx.#kw##Convolution)(::Array{Any,1}, ::MXNet.mx.#Convolution, ::MXNet.mx.SymbolicNode) at ./<missing>:0
   [8] test_attrs() at /home/dss/.julia/v0.6/MXNet/test/unittest/symbolic-node.jl:140
   [9] macro expansion at /home/dss/.julia/v0.6/MXNet/test/unittest/symbolic-node.jl:535 [inlined]
   [10] macro expansion at ./test.jl:860 [inlined]
   [11] anonymous at ./<missing>:?
   [12] include_from_node1(::String) at ./loading.jl:576
   [13] include(::String) at ./sysimg.jl:14
   [14] collect_to!(::Array{Module,1}, ::Base.Generator{Array{String,1},##3#6{String}}, ::Int64, ::Int64) at ./array.jl:508
   [15] _collect(::Array{String,1}, ::Base.Generator{Array{String,1},##3#6{String}}, ::Base.EltypeUnknown, ::Base.HasShape) at ./array.jl:489
   [16] test_dir(::String) at /home/dss/.julia/v0.6/MXNet/test/runtests.jl:9
   [17] macro expansion at /home/dss/.julia/v0.6/MXNet/test/runtests.jl:18 [inlined]
   [18] macro expansion at ./test.jl:860 [inlined]
   [19] anonymous at ./<missing>:?
   [20] include_from_node1(::String) at ./loading.jl:576
   [21] include(::String) at ./sysimg.jl:14

That error is from recent uptream changes.
See apache/mxnet#9677 (comment).