Segmentation fault inside GPU computation

Question

Segmentation fault inside GPU computation

Aetf opened this issue 7 years ago · comments

Aetf commented 7 years ago

On the donot-clone-this branch.

The executor segfaults inside the Multiply kernel (this is the cwise multiplication).

Step to reproduce

launch executor
run the unit test: test_ops_tf.py TestBasicOps.test_multiply

Expected

Test passed

Actual

Segmentation fault inside the computation kernel.

Analysis

The log shows that both inputs are correctly allocated on GPU (Different address prefixs than those on CPU). The error is generated in the CPU side. Probably not inside the GPU code.

However the segfault happens deeply inside the computation which is Eigen used by tensorflow. Maybe due to passing complex object across the dynamic library boundary, and some fields of the object were zeroed out or contained garbage value. Need more investigation here.

Add a stack dump here
Test if any other kernels have a similar problem
~~Implement a simple CUDA kernel and try to run that. See if that works.~~

Update

Other kernels like matmul or conv2d works well. So the problem is within the multiply kernel.

Tan N. Le · Answer 1 · Thu Jul 06 2017 10:46:12 GMT+0800 (China Standard Time)

When I forced to use CPU only, it also returns the segmentation fault. Does it work at your side?

ExecutionEngine::schedule(ITask *t){
trySchedule(t, DeviceType::CPU)
}

Aetf · Answer 2 · Thu Jul 06 2017 11:26:34 GMT+0800 (China Standard Time)

Yes, it works at my side. Could you try the latest master? I added an environment variable to control the scheduling behavior: EXEC_SCHED_USE_GPU, see #6. I also pushed a few commits to tensorflow-rpcdev, so remember to update that, too.

Regarding to your email:

At TfSession.cpp:226
It should be if (m_graph==NULL){} instead of if(!m_graph) {}

m_graph is an std::unique_ptr, directly casting it to bool like if (m_graph) or if (!m_graph) has the same effect to check if it is empty. See documentation. Actually I used this style all over the places in my code. So it hardly could be a problem.

I'm not in my office currently. I'll attach a stack trace of the crash tomorrow. It happens later, after the initialization.

PS: in c++11 and later, nullptr is preferred to NULL to represent a null pointer. Some informative SO questions. Using nullptr is also consistent with other places in the code base.

Aetf · Answer 3 · Fri Jul 07 2017 01:50:34 GMT+0800 (China Standard Time)

Here's the stack trace for the first few frames of the crash. As you can see, in frame 5, the crash happens exactly when called device->Compute(), here.

At this point, the initialization is already done, which is called in a executor.RunGraphRequest received earlier.

#0  0x00003fffb06eb26c in std::_Function_handler<void (long, long), Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<int, int>, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const> const> const, Eigen::ThreadPoolDevice, true>::run(Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<int, int>, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const> const> const&, Eigen::ThreadPoolDevice const&)::{lambda(long, long)#1}>::_M_invoke(std::_Any_data const&, long&&, std::_Any_data const&) () from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#1  0x00003fffaf03a1d8 in Eigen::ThreadPoolDevice::parallelFor(long, Eigen::TensorOpCost const&, std::function<long (long)>, std::function<void (long, long)>) const () from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#2  0x00003fffb06fada0 in tensorflow::BinaryOp<Eigen::ThreadPoolDevice, tensorflow::functor::mul<int> >::Compute(tensorflow::OpKernelContext*) ()
   from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#3  0x00003fffb0e62e4c in tensorflow::BaseGPUDevice::ComputeHelper (this=0x3fff718d3000, op_kernel=0x3bffb00023b0, context=0x3bffb0002b40)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:389
#4  0x00003fffb0e63298 in tensorflow::BaseGPUDevice::Compute (this=0x3fff718d3000, op_kernel=0x3bffb00023b0, context=0x3bffb0002b40)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:331
#5  0x0000000010121acc in TFRunTask::run (this=0x10b9c1b0) at /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/executor/src/oplibraries/tfoplibrary.cpp:272
#6  0x000000001009318c in ITask::run<executor::RunResponse> (this=0x10b9c1b0)
    at /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/executor/src/oplibraries/ioplibrary.h:54
#7  0x000000001008a270 in q::promise<std::unique_ptr<executor::RunResponse, std::default_delete<executor::RunResponse> > > ExecutionEngine::enqueue<executor::RunResponse>(std::unique_ptr<ITask, std::default_delete<ITask> >&&)::{lambda(auto:1, auto:2)#1}::operator()<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_delete<executor::RunResponse> > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_delete<executor::RunResponse> >, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, {lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()()::{lambda(std::default_delete<executor::RunResponse>&&)#1}, q::remove_rvalue_reference<std::default_delete<executor::RunResponse> >::type {lambda(auto:1, auto:2)#1}::operator()<std::default_delete<executor::RunResponse>, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const, std::shared_ptr<q::queue> const&)::{lambda(auto:1, auto:2)#1}&&::operator()()::{lambda(auto:1)#2}> (__closure=0x3fff7fffe318, resolve=..., reject=...)
    at /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/executor/src/execution/executionengine.h:60

Aetf · Answer 4 · Fri Jul 07 2017 02:05:17 GMT+0800 (China Standard Time)

I'm attaching the full log here.

mulcrash.tar.gz

Tan N. Le · Answer 5 · Fri Jul 07 2017 03:37:15 GMT+0800 (China Standard Time)

At my side, the segmentation fault does not happens if I changed that one.
Perhaps, we need to create a file that has the configuration, dependencies, compiler versions of your & mine. So, we can keep track this bug.

Btw, can you share with me the way to log the stack trace? my log just stops at segmentation fault.

Aetf · Answer 6 · Fri Jul 07 2017 03:53:24 GMT+0800 (China Standard Time)

I'm just attaching the gdb when starts the executor: gdb executor.

Then when it crashes, it will drop you to the gdb console, where you can type

(gdb) bt

to get the full trace. I'm curious to see the stack trace at your side. It seems like a different crash.

I'm compiling tensorflow and executor using gcc 5.4.0, with latest version of dependencies (include boost, which is 1.64.0).

Tan N. Le · Answer 7 · Fri Jul 07 2017 10:39:21 GMT+0800 (China Standard Time)

I got a different log. It seems my GDB does not work well although I used the latest version GDB 8.0

[2017-07-06 22:29:15.455] [5027] [console] [T] ==============================================================
[2017-07-06 22:29:15.455] [5027] [console] [T] Received identity frame 0: zmq::message_t(len=5, data='006B8B4567')
[2017-07-06 22:29:15.455] [5027] [console] [T] Received identity frame 1: zmq::message_t(len=0, data='')
[2017-07-06 22:29:15.455] [5027] [console] [T] Received evenlop frame: zmq::message_t(len=56, data='0A136578...696F6E30')
[2017-07-06 22:29:15.455] [5027] [console] [T] Received body frame: zmq::message_t(len=47, data='0A1C1207...746F7230')
[2017-07-06 22:29:15.455] [5027] [console] [D] Received request evenlop: EvenlopDef(type='executor.RunRequest', seq=2, recvId='74656E73...3A010101')
[2017-07-06 22:29:15.455] [5027] [console] [D] Received request body byte array size 47
[2017-07-06 22:29:15.455] [5027] [console] [I] Serving executor.RunRequest for oplibrary TENSORFLOW
[2017-07-06 22:29:15.455] [5027] [console] [I] Serving RunRequest with opkernel id _SOURCE
[2017-07-06 22:29:15.455] [5027] [console] [T] Blocking pool on pollin events

Thread 7 "executionengine" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffcebfa700 (LWP 5058)]
Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
0x0000000000483bec in ZmqServer::SenderImpl::sequenceNumber (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x0) at /home/tanle/projects/executor/src/rpcserver/zmqserver.cpp:330
330 return m_seq;
(gdb) bt
Python Exception <type 'exceptions.ImportError'> No module named gdb.frames:
#0 0x0000000000483bec in ZmqServer::SenderImpl::sequenceNumber (this=0x0) at /home/tanle/projects/executor/src/rpcserver/zmqserver.cpp:330
#1 0x00000000004b2892 in TFRunTask::prepare (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x10c0090, dev=...) at /home/tanle/projects/executor/src/oplibraries/tfoplibrary.cpp:235
#2 0x00000000004431c6 in ExecutionEngine::trySchedule (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x7e4f70 ExecutionEngine::instance()::eng, t=0x10c0090, dev=...)
at /home/tanle/projects/executor/src/execution/executionengine.cpp:74
#3 0x000000000044307e in ExecutionEngine::schedule (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x7e4f70 ExecutionEngine::instance()::eng, t=0x10c0090)
at /home/tanle/projects/executor/src/execution/executionengine.cpp:58
#4 0x000000000045ec06 in q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}::operator()<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, {lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()()::{lambda(std::default_deleteexecutor::RunResponse&&)#1}, q::remove_rvalue_reference<std::default_deleteexecutor::RunResponse >::type {lambda(auto:1, auto:2)#1}::operator()<std::default_deleteexecutor::RunResponse, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const, std::shared_ptr<q::queue> const&)::{lambda(auto:1, auto:2)#1}&&::operator()()::{lambda(auto:1)#2}> (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
__closure=0x7fffcebf9cd8,
resolve=..., reject=...) at /home/tanle/projects/executor/src/execution/executionengine.h:54
#5 0x000000000045e944 in q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()() (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
__closure=0x7fffcebf9cc8) at /usr/local/include/q/promise/make.hpp:251
#6 0x000000000047e2ac in q::detail::specific_function<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}&&)::{lambda()#1}, void (), false, void>::operator()() (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x7fffcebf9cc0)
at /usr/local/include/q/function.hpp:176
#7 0x000000000051f9ed in q::detail::any_function<void (), std::integral_constant<bool, false>, std::integral_constant<unsigned long, 128ul>, void>:---Type to continue, or q to quit---
:operator()() (this=0x7fffcebf9cc0) at /home/tanle/projects/q/libs/q/include/q/function.hpp:820
#8 q::threadpool::<lambda()>::<lambda(q::task&&)>::operator() (__closure=, elem=...)
at /home/tanle/projects/q/libs/q/src/threadpool.cpp:153
#9 q::threadpool::<lambda()>::operator() (__closure=, this=, this=)
at /home/tanle/projects/q/libs/q/src/threadpool.cpp:207
#10 q::call_with_args_by_tuple<q::threadpool::start()::<lambda()> > (fn=...) at /home/tanle/projects/q/libs/q/include/q/functional.hpp:821
#11 q::thread::<lambda()>::<lambda()>::operator() (__closure=) at /home/tanle/projects/q/libs/q/include/q/thread.hpp:158
#12 q::call_with_args_by_fun<q::expect<void, true, true> (&)(), q::thread::run(Fn&&, Args&& ...)::<lambda()> mutable [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void]::<lambda()>&> (inner_fn=..., fn=)
at /home/tanle/projects/q/libs/q/include/q/functional.hpp:859
#13 q::thread::<lambda()>::operator() (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
__closure=) at /home/tanle/projects/q/libs/q/include/q/thread.hpp:162
#14 std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()>::_M_invoke<> (this=)
at /usr/include/c++/5/functional:1531
#15 std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()>::operator() (this=)
at /usr/include/c++/5/functional:1520
#16 std::thread::_Impl<std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()> >::_M_run(void) (
this=) at /usr/include/c++/5/thread:115
#17 0x00007fffec2f54a0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
#18 0x00007fffec5ca6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
#19 0x00007fffeba563dd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Aetf · Answer 8 · Fri Jul 07 2017 10:49:47 GMT+0800 (China Standard Time)

There must be something wrong with your gdb. Maybe you should reinstall it.

The crash you got is due to a nullptr dereference. In frame 0 ZmqServer::SenderImpl::sequenceNumber (this=0x0), and called from frame 1 at at /home/tanle/projects/executor/src/oplibraries/tfoplibrary.cpp:235. However, this is unrelated to the compputation and should never happen. The sender object is created here as a std::shared_ptr, and passed into TFRunTask. Did you ever modified anything under rpcserver or tfoplibrary.cpp?

Tan N. Le · Answer 9 · Fri Jul 07 2017 10:52:44 GMT+0800 (China Standard Time)

I haven't modified those files.

New log

[2017-07-06 22:50:54.328] [15562] [console] [T] ==============================================================
[2017-07-06 22:50:54.328] [15562] [console] [T] Received identity frame 0: zmq::message_t(len=5, data='006B8B4567')
[2017-07-06 22:50:54.328] [15562] [console] [T] Received identity frame 1: zmq::message_t(len=0, data='')
[2017-07-06 22:50:54.328] [15562] [console] [T] Received evenlop frame: zmq::message_t(len=56, data='0A136578...696F6E30')
[2017-07-06 22:50:54.328] [15562] [console] [T] Received body frame: zmq::message_t(len=47, data='0A1C1207...746F7230')
[2017-07-06 22:50:54.328] [15562] [console] [D] Received request evenlop: EvenlopDef(type='executor.RunRequest', seq=2, recvId='74656E73...3A000100')
[2017-07-06 22:50:54.328] [15562] [console] [D] Received request body byte array size 47
[2017-07-06 22:50:54.328] [15562] [console] [I] Serving executor.RunRequest for oplibrary TENSORFLOW
[2017-07-06 22:50:54.328] [15562] [console] [I] Serving RunRequest with opkernel id _SOURCE
[2017-07-06 22:50:54.328] [15562] [console] [T] Blocking pool on pollin events

Thread 8 "executionengine" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffce3f9700 (LWP 15594)]
0x0000000000483bec in ZmqServer::SenderImpl::sequenceNumber (this=0x0) at /home/tanle/projects/executor/src/rpcserver/zmqserver.cpp:330
330 return m_seq;
(gdb) bt
#0 0x0000000000483bec in ZmqServer::SenderImpl::sequenceNumber (this=0x0) at /home/tanle/projects/executor/src/rpcserver/zmqserver.cpp:330
#1 0x00000000004b2892 in TFRunTask::prepare (this=0x10872c0, dev=...) at /home/tanle/projects/executor/src/oplibraries/tfoplibrary.cpp:235
#2 0x00000000004431c6 in ExecutionEngine::trySchedule (this=0x7e4f70 ExecutionEngine::instance()::eng, t=0x10872c0, dev=...)
at /home/tanle/projects/executor/src/execution/executionengine.cpp:74
#3 0x000000000044307e in ExecutionEngine::schedule (this=0x7e4f70 ExecutionEngine::instance()::eng, t=0x10872c0)
at /home/tanle/projects/executor/src/execution/executionengine.cpp:58
#4 0x000000000045ec06 in q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}::operator()<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, {lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()()::{lambda(std::default_deleteexecutor::RunResponse&&)#1}, q::remove_rvalue_reference<std::default_deleteexecutor::RunResponse >::type {lambda(auto:1, auto:2)#1}::operator()<std::default_deleteexecutor::RunResponse, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const, std::shared_ptr<q::queue> const&)::{lambda(auto:1, auto:2)#1}&&::operator()()::{lambda(auto:1)#2}> (__closure=0x7fffce3f8cd8,
resolve=..., reject=...) at /home/tanle/projects/executor/src/execution/executionengine.h:54
#5 0x000000000045e944 in q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()() (__closure=0x7fffce3f8cc8) at /usr/local/include/q/promise/make.hpp:251
#6 0x000000000047e2ac in q::detail::specific_function<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}&&)::{lambda()#1}, void (), false, void>::operator()() (this=0x7fffce3f8cc0)
at /usr/local/include/q/function.hpp:176
#7 0x000000000051f9ed in q::detail::any_function<void (), std::integral_constant<bool, false>, std::integral_constant<unsigned long, 128ul>, void>::operator()() (this=0x7fffce3f8cc0) at /home/tanle/projects/q/libs/q/include/q/function.hpp:820
#8 q::threadpool::<lambda()>::<lambda(q::task&&)>::operator() (__closure=, elem=...)
---Type to continue, or q to quit---
at /home/tanle/projects/q/libs/q/src/threadpool.cpp:153
#9 q::threadpool::<lambda()>::operator() (__closure=, this=, this=)
at /home/tanle/projects/q/libs/q/src/threadpool.cpp:207
#10 q::call_with_args_by_tuple<q::threadpool::start()::<lambda()> > (fn=...) at /home/tanle/projects/q/libs/q/include/q/functional.hpp:821
#11 q::thread::<lambda()>::<lambda()>::operator() (__closure=) at /home/tanle/projects/q/libs/q/include/q/thread.hpp:158
#12 q::call_with_args_by_fun<q::expect<void, true, true> (&)(), q::thread::run(Fn&&, Args&& ...)::<lambda()> mutable [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void]::<lambda()>&> (inner_fn=..., fn=)
at /home/tanle/projects/q/libs/q/include/q/functional.hpp:859
#13 q::thread::<lambda()>::operator() (__closure=) at /home/tanle/projects/q/libs/q/include/q/thread.hpp:162
#14 std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()>::_M_invoke<> (this=)
at /usr/include/c++/5/functional:1531
#15 std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()>::operator() (this=)
at /usr/include/c++/5/functional:1520
#16 std::thread::_Impl<std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()> >::_M_run(void) (
this=) at /usr/include/c++/5/thread:115
#17 0x00007fffec2f54a0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007fffec5ca6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#19 0x00007fffeba563dd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Aetf · Answer 10 · Fri Jul 07 2017 11:02:09 GMT+0800 (China Standard Time)

Okay so the gdb is fixed. But anyway the stack trace shows the same nullptr error.

Could you post the full log as a file?

Tan N. Le · Answer 11 · Fri Jul 07 2017 11:08:42 GMT+0800 (China Standard Time)

Here is the log.

issue#1_log.tar.gz

Aetf · Answer 12 · Fri Jul 07 2017 11:17:05 GMT+0800 (China Standard Time)

Try make clean and make again. I really can't see what would be the problem.

If you still get the same stack trace. See if you can add some logs print out the content of m_sender in TFRunTask, basically check if it's empty in TFRunTask::TFRunTask. Also check sender in TFOpLibrary::createRunTask.

Tan N. Le · Answer 13 · Fri Jul 07 2017 11:21:13 GMT+0800 (China Standard Time)

I could not find what problem with return m_seq either.

BTW, the "NULL" solution is not the right solution. I made a mistake when running the test, TF job stops before segfault happens.

Aetf · Answer 14 · Fri Jul 07 2017 11:25:00 GMT+0800 (China Standard Time)

return m_seq is fine. The problem is SenderImpl::sequenceNumber is called through a nullptr (this=0x0, as in gdb stack trace), and why that could happen.

Aetf · Answer 15 · Fri Jul 07 2017 11:25:28 GMT+0800 (China Standard Time)

And oddly enough, the same code runs well on my laptop and on the cluster I'm using.

Tan N. Le · Answer 16 · Fri Jul 07 2017 11:27:44 GMT+0800 (China Standard Time)

After I cleaned, it has the same log, the segmentation fault at the same place.
These kinds of bugs can happen in C/C++. Btw, can we use the same GCC version for both TF and Executor because Executor uses TF libraries?

Tan N. Le · Answer 17 · Fri Jul 07 2017 11:29:25 GMT+0800 (China Standard Time)

Should I try with the latest code in Master branch?

Aetf · Answer 18 · Fri Jul 07 2017 11:29:44 GMT+0800 (China Standard Time)

Yes of course. You didn't use latest code?

Tan N. Le · Answer 19 · Fri Jul 07 2017 11:30:49 GMT+0800 (China Standard Time)

No, I used the branch from Master on Wed.

Aetf · Answer 20 · Fri Jul 07 2017 11:33:57 GMT+0800 (China Standard Time)

The latest commit is 58ed43f.

As for compiler, the master branch can be compiled with gcc 5.4.0. So you can use that to compile both TF and executor.

And do try to print something like I said to check when did m_sender got empty if the crash still occurs.

Tan N. Le · Answer 21 · Fri Jul 07 2017 11:44:34 GMT+0800 (China Standard Time)

I cannot compile Executor either with 58ed43f or the latest version on Master.

/usr/include/c++/5/bits/range_access.h:68:5: error: ‘const class std::unordered_map<executor::OpLibraryType, std::unique_ptr >’ has no member named ‘end’

Aetf · Answer 22 · Fri Jul 07 2017 11:54:26 GMT+0800 (China Standard Time)

58ed43f is the latest version.

Did you regenerate cmake configure when you change the compiler?

Aetf · Answer 23 · Fri Jul 07 2017 12:22:49 GMT+0800 (China Standard Time)

Anyway, I checked with other kernels (matmul and conv2d). They worked well. So the problem must be within the multiply kernel.

Tan N. Le · Answer 24 · Fri Jul 07 2017 22:18:44 GMT+0800 (China Standard Time)

I deleted the build folder and rebuild it. it works with gcc 5.4.
However, it has the same segmentation fault like I did. Perhaps I will find the bug at my side.

Aetf · Answer 25 · Fri Jul 07 2017 23:02:46 GMT+0800 (China Standard Time)

Open a new issue then. That appear to be a different issue. Write down the step to reproduce it. I'll try to reproduce later.

…

On Fri, Jul 7, 2017, 10:18 lenhattan86 ***@***.***> wrote: I deleted the build folder and rebuild it. it works with gcc 5.4. However, it has the same segmentation fault like I did. Perhaps I will find the bug at my side. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABcwj2xglxOTaZnAsLK4CEg3HkuZOxqmks5sLj5EgaJpZM4ONntU> .

Aetf · Answer 26 · Fri Jul 07 2017 23:38:24 GMT+0800 (China Standard Time)

I've found the reason!

#3  0x00003fffb0522274 in tensorflow::BinaryOp<Eigen::ThreadPoolDevice, tensorflow::functor::add<int> >::Compute(tensorflow::OpKernelContext*) ()
   from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#4  0x00003fffb0e42e4c in tensorflow::BaseGPUDevice::ComputeHelper (this=0x3fff718d3010, op_kernel=0x3bffdc000bb0, context=0x3bffdc001740)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:389

In frame 3, the kernel instantiated should be a GPU kernel, i.e. tensorflow::BinaryOp<Eigen::GPUDevice, tensorflow::functor::add<int>>, but it is a CPU kernel, thus the crash.

I'm checking the kernel creation logic to see what's going on there.

Aetf · Answer 27 · Fri Jul 07 2017 23:49:18 GMT+0800 (China Standard Time)

Additional stack trace, for multiply, basically the same thing, which verifies my above reasoning.

#2  0x00003fffb06dada0 in tensorflow::BinaryOp<Eigen::ThreadPoolDevice, tensorflow::functor::mul<int> >::Compute(tensorflow::OpKernelContext*) ()
   from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#3  0x00003fffb0e42e4c in tensorflow::BaseGPUDevice::ComputeHelper (this=0x3fff718d32d0, op_kernel=0x3bffdc000bb0, context=0x3bffdc001740)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:389

Aetf · Answer 28 · Sat Jul 08 2017 02:17:15 GMT+0800 (China Standard Time)

The root issue is incorrect handling of output allocation attributes. Currently only GPU computation with int32 data type has this issue. Other data types work well.

I've opened a new issues #12 to trace the problem. Closing this one.