SymbioticLab / Salus

Fine-grained GPU sharing primitives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segmentation fault inside GPU computation

Aetf opened this issue · comments

commented

On the donot-clone-this branch.

The executor segfaults inside the Multiply kernel (this is the cwise multiplication).

Step to reproduce

  1. launch executor
  2. run the unit test: test_ops_tf.py TestBasicOps.test_multiply

Expected

Test passed

Actual

Segmentation fault inside the computation kernel.

Analysis

The log shows that both inputs are correctly allocated on GPU (Different address prefixs than those on CPU). The error is generated in the CPU side. Probably not inside the GPU code.

However the segfault happens deeply inside the computation which is Eigen used by tensorflow. Maybe due to passing complex object across the dynamic library boundary, and some fields of the object were zeroed out or contained garbage value. Need more investigation here.

  • Add a stack dump here
  • Test if any other kernels have a similar problem
  • Implement a simple CUDA kernel and try to run that. See if that works.

Update

Other kernels like matmul or conv2d works well. So the problem is within the multiply kernel.

When I forced to use CPU only, it also returns the segmentation fault. Does it work at your side?

ExecutionEngine::schedule(ITask *t){
trySchedule(t, DeviceType::CPU)
}

commented

Yes, it works at my side. Could you try the latest master? I added an environment variable to control the scheduling behavior: EXEC_SCHED_USE_GPU, see #6. I also pushed a few commits to tensorflow-rpcdev, so remember to update that, too.

Regarding to your email:

At TfSession.cpp:226
It should be if (m_graph==NULL){} instead of if(!m_graph) {}

m_graph is an std::unique_ptr, directly casting it to bool like if (m_graph) or if (!m_graph) has the same effect to check if it is empty. See documentation. Actually I used this style all over the places in my code. So it hardly could be a problem.

I'm not in my office currently. I'll attach a stack trace of the crash tomorrow. It happens later, after the initialization.

PS: in c++11 and later, nullptr is preferred to NULL to represent a null pointer. Some informative SO questions. Using nullptr is also consistent with other places in the code base.

commented

Here's the stack trace for the first few frames of the crash. As you can see, in frame 5, the crash happens exactly when called device->Compute(), here.

At this point, the initialization is already done, which is called in a executor.RunGraphRequest received earlier.

#0  0x00003fffb06eb26c in std::_Function_handler<void (long, long), Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<int, int>, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const> const> const, Eigen::ThreadPoolDevice, true>::run(Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<int, int>, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const> const> const&, Eigen::ThreadPoolDevice const&)::{lambda(long, long)#1}>::_M_invoke(std::_Any_data const&, long&&, std::_Any_data const&) () from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#1  0x00003fffaf03a1d8 in Eigen::ThreadPoolDevice::parallelFor(long, Eigen::TensorOpCost const&, std::function<long (long)>, std::function<void (long, long)>) const () from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#2  0x00003fffb06fada0 in tensorflow::BinaryOp<Eigen::ThreadPoolDevice, tensorflow::functor::mul<int> >::Compute(tensorflow::OpKernelContext*) ()
   from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#3  0x00003fffb0e62e4c in tensorflow::BaseGPUDevice::ComputeHelper (this=0x3fff718d3000, op_kernel=0x3bffb00023b0, context=0x3bffb0002b40)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:389
#4  0x00003fffb0e63298 in tensorflow::BaseGPUDevice::Compute (this=0x3fff718d3000, op_kernel=0x3bffb00023b0, context=0x3bffb0002b40)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:331
#5  0x0000000010121acc in TFRunTask::run (this=0x10b9c1b0) at /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/executor/src/oplibraries/tfoplibrary.cpp:272
#6  0x000000001009318c in ITask::run<executor::RunResponse> (this=0x10b9c1b0)
    at /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/executor/src/oplibraries/ioplibrary.h:54
#7  0x000000001008a270 in q::promise<std::unique_ptr<executor::RunResponse, std::default_delete<executor::RunResponse> > > ExecutionEngine::enqueue<executor::RunResponse>(std::unique_ptr<ITask, std::default_delete<ITask> >&&)::{lambda(auto:1, auto:2)#1}::operator()<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_delete<executor::RunResponse> > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_delete<executor::RunResponse> >, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, {lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()()::{lambda(std::default_delete<executor::RunResponse>&&)#1}, q::remove_rvalue_reference<std::default_delete<executor::RunResponse> >::type {lambda(auto:1, auto:2)#1}::operator()<std::default_delete<executor::RunResponse>, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const, std::shared_ptr<q::queue> const&)::{lambda(auto:1, auto:2)#1}&&::operator()()::{lambda(auto:1)#2}> (__closure=0x3fff7fffe318, resolve=..., reject=...)
    at /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/executor/src/execution/executionengine.h:60
commented

I'm attaching the full log here.

mulcrash.tar.gz

At my side, the segmentation fault does not happens if I changed that one.
Perhaps, we need to create a file that has the configuration, dependencies, compiler versions of your & mine. So, we can keep track this bug.

Btw, can you share with me the way to log the stack trace? my log just stops at segmentation fault.

commented

I'm just attaching the gdb when starts the executor: gdb executor.

Then when it crashes, it will drop you to the gdb console, where you can type

(gdb) bt

to get the full trace. I'm curious to see the stack trace at your side. It seems like a different crash.

I'm compiling tensorflow and executor using gcc 5.4.0, with latest version of dependencies (include boost, which is 1.64.0).

I got a different log. It seems my GDB does not work well although I used the latest version GDB 8.0

[2017-07-06 22:29:15.455] [5027] [console] [T] ==============================================================
[2017-07-06 22:29:15.455] [5027] [console] [T] Received identity frame 0: zmq::message_t(len=5, data='006B8B4567')
[2017-07-06 22:29:15.455] [5027] [console] [T] Received identity frame 1: zmq::message_t(len=0, data='')
[2017-07-06 22:29:15.455] [5027] [console] [T] Received evenlop frame: zmq::message_t(len=56, data='0A136578...696F6E30')
[2017-07-06 22:29:15.455] [5027] [console] [T] Received body frame: zmq::message_t(len=47, data='0A1C1207...746F7230')
[2017-07-06 22:29:15.455] [5027] [console] [D] Received request evenlop: EvenlopDef(type='executor.RunRequest', seq=2, recvId='74656E73...3A010101')
[2017-07-06 22:29:15.455] [5027] [console] [D] Received request body byte array size 47
[2017-07-06 22:29:15.455] [5027] [console] [I] Serving executor.RunRequest for oplibrary TENSORFLOW
[2017-07-06 22:29:15.455] [5027] [console] [I] Serving RunRequest with opkernel id _SOURCE
[2017-07-06 22:29:15.455] [5027] [console] [T] Blocking pool on pollin events

Thread 7 "executionengine" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffcebfa700 (LWP 5058)]
Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
0x0000000000483bec in ZmqServer::SenderImpl::sequenceNumber (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x0) at /home/tanle/projects/executor/src/rpcserver/zmqserver.cpp:330
330 return m_seq;
(gdb) bt
Python Exception <type 'exceptions.ImportError'> No module named gdb.frames:
#0 0x0000000000483bec in ZmqServer::SenderImpl::sequenceNumber (this=0x0) at /home/tanle/projects/executor/src/rpcserver/zmqserver.cpp:330
#1 0x00000000004b2892 in TFRunTask::prepare (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x10c0090, dev=...) at /home/tanle/projects/executor/src/oplibraries/tfoplibrary.cpp:235
#2 0x00000000004431c6 in ExecutionEngine::trySchedule (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x7e4f70 ExecutionEngine::instance()::eng, t=0x10c0090, dev=...)
at /home/tanle/projects/executor/src/execution/executionengine.cpp:74
#3 0x000000000044307e in ExecutionEngine::schedule (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x7e4f70 ExecutionEngine::instance()::eng, t=0x10c0090)
at /home/tanle/projects/executor/src/execution/executionengine.cpp:58
#4 0x000000000045ec06 in q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}::operator()<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, {lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()()::{lambda(std::default_deleteexecutor::RunResponse&&)#1}, q::remove_rvalue_reference<std::default_deleteexecutor::RunResponse >::type {lambda(auto:1, auto:2)#1}::operator()<std::default_deleteexecutor::RunResponse, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const, std::shared_ptr<q::queue> const&)::{lambda(auto:1, auto:2)#1}&&::operator()()::{lambda(auto:1)#2}> (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
__closure=0x7fffcebf9cd8,
resolve=..., reject=...) at /home/tanle/projects/executor/src/execution/executionengine.h:54
#5 0x000000000045e944 in q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()() (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
__closure=0x7fffcebf9cc8) at /usr/local/include/q/promise/make.hpp:251
#6 0x000000000047e2ac in q::detail::specific_function<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}&&)::{lambda()#1}, void (), false, void>::operator()() (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
this=0x7fffcebf9cc0)
at /usr/local/include/q/function.hpp:176
#7 0x000000000051f9ed in q::detail::any_function<void (), std::integral_constant<bool, false>, std::integral_constant<unsigned long, 128ul>, void>:---Type to continue, or q to quit---
:operator()() (this=0x7fffcebf9cc0) at /home/tanle/projects/q/libs/q/include/q/function.hpp:820
#8 q::threadpool::<lambda()>::<lambda(q::task&&)>::operator() (__closure=, elem=...)
at /home/tanle/projects/q/libs/q/src/threadpool.cpp:153
#9 q::threadpool::<lambda()>::operator() (__closure=, this=, this=)
at /home/tanle/projects/q/libs/q/src/threadpool.cpp:207
#10 q::call_with_args_by_tuple<q::threadpool::start()::<lambda()> > (fn=...) at /home/tanle/projects/q/libs/q/include/q/functional.hpp:821
#11 q::thread::<lambda()>::<lambda()>::operator() (__closure=) at /home/tanle/projects/q/libs/q/include/q/thread.hpp:158
#12 q::call_with_args_by_fun<q::expect<void, true, true> (&)(), q::thread::run(Fn&&, Args&& ...)::<lambda()> mutable [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void]::<lambda()>&> (inner_fn=..., fn=)
at /home/tanle/projects/q/libs/q/include/q/functional.hpp:859
#13 q::thread::<lambda()>::operator() (Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
__closure=) at /home/tanle/projects/q/libs/q/include/q/thread.hpp:162
#14 std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()>::_M_invoke<> (this=)
at /usr/include/c++/5/functional:1531
#15 std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()>::operator() (this=)
at /usr/include/c++/5/functional:1520
#16 std::thread::_Impl<std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()> >::_M_run(void) (
this=) at /usr/include/c++/5/thread:115
#17 0x00007fffec2f54a0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
#18 0x00007fffec5ca6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
Python Exception <type 'exceptions.NameError'> Installation error: gdb.execute_unwinders function is missing:
#19 0x00007fffeba563dd in clone () from /lib/x86_64-linux-gnu/libc.so.6

commented

There must be something wrong with your gdb. Maybe you should reinstall it.

The crash you got is due to a nullptr dereference. In frame 0 ZmqServer::SenderImpl::sequenceNumber (this=0x0), and called from frame 1 at at /home/tanle/projects/executor/src/oplibraries/tfoplibrary.cpp:235. However, this is unrelated to the compputation and should never happen. The sender object is created here as a std::shared_ptr, and passed into TFRunTask. Did you ever modified anything under rpcserver or tfoplibrary.cpp?

I haven't modified those files.

New log

[2017-07-06 22:50:54.328] [15562] [console] [T] ==============================================================
[2017-07-06 22:50:54.328] [15562] [console] [T] Received identity frame 0: zmq::message_t(len=5, data='006B8B4567')
[2017-07-06 22:50:54.328] [15562] [console] [T] Received identity frame 1: zmq::message_t(len=0, data='')
[2017-07-06 22:50:54.328] [15562] [console] [T] Received evenlop frame: zmq::message_t(len=56, data='0A136578...696F6E30')
[2017-07-06 22:50:54.328] [15562] [console] [T] Received body frame: zmq::message_t(len=47, data='0A1C1207...746F7230')
[2017-07-06 22:50:54.328] [15562] [console] [D] Received request evenlop: EvenlopDef(type='executor.RunRequest', seq=2, recvId='74656E73...3A000100')
[2017-07-06 22:50:54.328] [15562] [console] [D] Received request body byte array size 47
[2017-07-06 22:50:54.328] [15562] [console] [I] Serving executor.RunRequest for oplibrary TENSORFLOW
[2017-07-06 22:50:54.328] [15562] [console] [I] Serving RunRequest with opkernel id _SOURCE
[2017-07-06 22:50:54.328] [15562] [console] [T] Blocking pool on pollin events

Thread 8 "executionengine" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffce3f9700 (LWP 15594)]
0x0000000000483bec in ZmqServer::SenderImpl::sequenceNumber (this=0x0) at /home/tanle/projects/executor/src/rpcserver/zmqserver.cpp:330
330 return m_seq;
(gdb) bt
#0 0x0000000000483bec in ZmqServer::SenderImpl::sequenceNumber (this=0x0) at /home/tanle/projects/executor/src/rpcserver/zmqserver.cpp:330
#1 0x00000000004b2892 in TFRunTask::prepare (this=0x10872c0, dev=...) at /home/tanle/projects/executor/src/oplibraries/tfoplibrary.cpp:235
#2 0x00000000004431c6 in ExecutionEngine::trySchedule (this=0x7e4f70 ExecutionEngine::instance()::eng, t=0x10872c0, dev=...)
at /home/tanle/projects/executor/src/execution/executionengine.cpp:74
#3 0x000000000044307e in ExecutionEngine::schedule (this=0x7e4f70 ExecutionEngine::instance()::eng, t=0x10872c0)
at /home/tanle/projects/executor/src/execution/executionengine.cpp:58
#4 0x000000000045ec06 in q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}::operator()<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, {lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()()::{lambda(std::default_deleteexecutor::RunResponse&&)#1}, q::remove_rvalue_reference<std::default_deleteexecutor::RunResponse >::type {lambda(auto:1, auto:2)#1}::operator()<std::default_deleteexecutor::RunResponse, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const, std::shared_ptr<q::queue> const&)::{lambda(auto:1, auto:2)#1}&&::operator()()::{lambda(auto:1)#2}> (__closure=0x7fffce3f8cd8,
resolve=..., reject=...) at /home/tanle/projects/executor/src/execution/executionengine.h:54
#5 0x000000000045e944 in q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()() (__closure=0x7fffce3f8cc8) at /usr/local/include/q/promise/make.hpp:251
#6 0x000000000047e2ac in q::detail::specific_function<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse >, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, q::promise<std::unique_ptr<executor::RunResponse, std::default_deleteexecutor::RunResponse > > ExecutionEngine::enqueueexecutor::RunResponse(std::unique_ptr<ITask, std::default_delete >&&)::{lambda(auto:1, auto:2)#1}&&)::{lambda()#1}, void (), false, void>::operator()() (this=0x7fffce3f8cc0)
at /usr/local/include/q/function.hpp:176
#7 0x000000000051f9ed in q::detail::any_function<void (), std::integral_constant<bool, false>, std::integral_constant<unsigned long, 128ul>, void>::operator()() (this=0x7fffce3f8cc0) at /home/tanle/projects/q/libs/q/include/q/function.hpp:820
#8 q::threadpool::<lambda()>::<lambda(q::task&&)>::operator() (__closure=, elem=...)
---Type to continue, or q to quit---
at /home/tanle/projects/q/libs/q/src/threadpool.cpp:153
#9 q::threadpool::<lambda()>::operator() (__closure=, this=, this=)
at /home/tanle/projects/q/libs/q/src/threadpool.cpp:207
#10 q::call_with_args_by_tuple<q::threadpool::start()::<lambda()> > (fn=...) at /home/tanle/projects/q/libs/q/include/q/functional.hpp:821
#11 q::thread::<lambda()>::<lambda()>::operator() (__closure=) at /home/tanle/projects/q/libs/q/include/q/thread.hpp:158
#12 q::call_with_args_by_fun<q::expect<void, true, true> (&)(), q::thread::run(Fn&&, Args&& ...)::<lambda()> mutable [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void]::<lambda()>&> (inner_fn=..., fn=)
at /home/tanle/projects/q/libs/q/include/q/functional.hpp:859
#13 q::thread::<lambda()>::operator() (__closure=) at /home/tanle/projects/q/libs/q/include/q/thread.hpp:162
#14 std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()>::_M_invoke<> (this=)
at /usr/include/c++/5/functional:1531
#15 std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()>::operator() (this=)
at /usr/include/c++/5/functional:1520
#16 std::thread::_Impl<std::_Bind_simple<q::thread::run(Fn&&, Args&& ...) [with Fn = q::threadpool::start()::<lambda()>; Args = {}; Ret = void; typename std::enable_if<std::is_same<typename q::function_traits::result_type, Ret>::value>::type = void]::<lambda()>()> >::_M_run(void) (
this=) at /usr/include/c++/5/thread:115
#17 0x00007fffec2f54a0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00007fffec5ca6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#19 0x00007fffeba563dd in clone () from /lib/x86_64-linux-gnu/libc.so.6

commented

Okay so the gdb is fixed. But anyway the stack trace shows the same nullptr error.

Could you post the full log as a file?

Here is the log.

issue#1_log.tar.gz

commented

Try make clean and make again. I really can't see what would be the problem.

If you still get the same stack trace. See if you can add some logs print out the content of m_sender in TFRunTask, basically check if it's empty in TFRunTask::TFRunTask. Also check sender in TFOpLibrary::createRunTask.

I could not find what problem with return m_seq either.

BTW, the "NULL" solution is not the right solution. I made a mistake when running the test, TF job stops before segfault happens.

commented

return m_seq is fine. The problem is SenderImpl::sequenceNumber is called through a nullptr (this=0x0, as in gdb stack trace), and why that could happen.

commented

And oddly enough, the same code runs well on my laptop and on the cluster I'm using.

After I cleaned, it has the same log, the segmentation fault at the same place.
These kinds of bugs can happen in C/C++. Btw, can we use the same GCC version for both TF and Executor because Executor uses TF libraries?

Should I try with the latest code in Master branch?

commented

Yes of course. You didn't use latest code?

No, I used the branch from Master on Wed.

commented

The latest commit is 58ed43f.

As for compiler, the master branch can be compiled with gcc 5.4.0. So you can use that to compile both TF and executor.

And do try to print something like I said to check when did m_sender got empty if the crash still occurs.

I cannot compile Executor either with 58ed43f or the latest version on Master.

/usr/include/c++/5/bits/range_access.h:68:5: error: ‘const class std::unordered_map<executor::OpLibraryType, std::unique_ptr >’ has no member named ‘end’

commented

58ed43f is the latest version.

Did you regenerate cmake configure when you change the compiler?

commented

Anyway, I checked with other kernels (matmul and conv2d). They worked well. So the problem must be within the multiply kernel.

I deleted the build folder and rebuild it. it works with gcc 5.4.
However, it has the same segmentation fault like I did. Perhaps I will find the bug at my side.

commented
commented

I've found the reason!

#3  0x00003fffb0522274 in tensorflow::BinaryOp<Eigen::ThreadPoolDevice, tensorflow::functor::add<int> >::Compute(tensorflow::OpKernelContext*) ()
   from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#4  0x00003fffb0e42e4c in tensorflow::BaseGPUDevice::ComputeHelper (this=0x3fff718d3010, op_kernel=0x3bffdc000bb0, context=0x3bffdc001740)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:389

In frame 3, the kernel instantiated should be a GPU kernel, i.e. tensorflow::BinaryOp<Eigen::GPUDevice, tensorflow::functor::add<int>>, but it is a CPU kernel, thus the crash.

I'm checking the kernel creation logic to see what's going on there.

commented

Additional stack trace, for multiply, basically the same thing, which verifies my above reasoning.

#2  0x00003fffb06dada0 in tensorflow::BinaryOp<Eigen::ThreadPoolDevice, tensorflow::functor::mul<int> >::Compute(tensorflow::OpKernelContext*) ()
   from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#3  0x00003fffb0e42e4c in tensorflow::BaseGPUDevice::ComputeHelper (this=0x3fff718d32d0, op_kernel=0x3bffdc000bb0, context=0x3bffdc001740)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:389
commented

The root issue is incorrect handling of output allocation attributes. Currently only GPU computation with int32 data type has this issue. Other data types work well.

I've opened a new issues #12 to trace the problem. Closing this one.