Segmentation Fault - sender becomes nullptr in ITask::prepare
lenhattan86 opened this issue · comments
Version:
Procedures:
-
Build & install both Tensorflow (for CUDA) & Executor by gcc 5.4
-
Run Executor (using gdb to get the full log)
-
Run the test job: python test_ops_tf.py TestBasicOps.test_multiply
-
Observe the segmentation fault
Expected:
no segmentation fault
Actual Result
Segmentation fault
Full log: log.tar.gz
root cause: passing shared_ptr to a promise function like q::promise RpcServerCore::dispatch may cause nullptr error.
Fixed as below.
// [issue#11] passing shared_ptr "sender" may cause nullptr segmentationf fault
// as dispatch is executed in an asynchronous manner.
//auto f = m_pLogic->dispatch(sender, *pEvenlop, *pRequest)
auto f = m_pLogic->dispatch(std::make_shared(*this, pEvenlop->seq(), std::move(identities)), *pEvenlop, *pRequest)
I doubt this is the root cause of the crash. The body of dispatch
is executed synchronously, in which a task is created with sender
moved in. The sender
is created only a few lines above (in step 1 at line 279), so it won't be nullptr anyway.
Your fix also actually creates another issue. Because of std::move
, the identity
is moved into sender at line 279, after that identity
will be empty. So you are passing in empty identity
to SenderImpl
constructor, which will cause some reply message not received by TF.
Note that sender
is shared_ptr
, which is reference counted. So passing it around by value is fine among threads.
@lenhattan86 I added a few assertions. Please build and run the latest commit, and post the stack trace for the crash.
Please refer the attached log.
Interestingly, TF job does not send any message to Executor when I ran "python test_ops_tf.py TestBasicOps.test_multiply_int32"
Instead, I ran "python test_ops_tf.py TestBasicOps.test_noop" to have the log. It should be the nullptr one.
I just fixed a wrong assertion in code. Could you update and rebuild and rerun?
The log is not fully flushed when it crashes. You can run p logging::logger->flush()
after the crash in gdb to flush the log. I need to know the exact op kernel running while the crash happens.