Segmentation Fault - sender becomes nullptr in ITask::prepare

Question

Segmentation Fault - sender becomes nullptr in ITask::prepare

lenhattan86 opened this issue 7 years ago · comments

Tan N. Le commented 7 years ago

Version:

Executor: 78ed43f
Tensorflow-rpc-dev: 94f3202

Procedures:

Build & install both Tensorflow (for CUDA) & Executor by gcc 5.4
Run Executor (using gdb to get the full log)
Run the test job: python test_ops_tf.py TestBasicOps.test_multiply
Observe the segmentation fault

Expected:
no segmentation fault
Actual Result
Segmentation fault

Full log: log.tar.gz

Tan N. Le · Answer 1 · Mon Jul 10 2017 21:48:33 GMT+0800 (China Standard Time)

root cause: passing shared_ptr to a promise function like q::promise RpcServerCore::dispatch may cause nullptr error.

Fixed as below.
// [issue#11] passing shared_ptr "sender" may cause nullptr segmentationf fault
// as dispatch is executed in an asynchronous manner.
//auto f = m_pLogic->dispatch(sender, *pEvenlop, *pRequest)
auto f = m_pLogic->dispatch(std::make_shared(*this, pEvenlop->seq(), std::move(identities)), *pEvenlop, *pRequest)

Tan N. Le · Answer 2 · Mon Jul 10 2017 22:09:56 GMT+0800 (China Standard Time)

This does not occur in revision 33ce74e.
So, I rolled this fix back.

Tan N. Le · Answer 3 · Mon Jul 10 2017 22:25:49 GMT+0800 (China Standard Time)

This one indeed occurs time to time. It occurs again when I test issue #8 @ rev. 33ce74e.
I put the fix back.

Aetf · Answer 4 · Thu Jul 20 2017 03:10:21 GMT+0800 (China Standard Time)

I doubt this is the root cause of the crash. The body of dispatch is executed synchronously, in which a task is created with sender moved in. The sender is created only a few lines above (in step 1 at line 279), so it won't be nullptr anyway.

Your fix also actually creates another issue. Because of std::move, the identity is moved into sender at line 279, after that identity will be empty. So you are passing in empty identity to SenderImpl constructor, which will cause some reply message not received by TF.

Aetf · Answer 5 · Thu Jul 20 2017 03:13:31 GMT+0800 (China Standard Time)

Note that sender is shared_ptr, which is reference counted. So passing it around by value is fine among threads.

Aetf · Answer 6 · Thu Jul 20 2017 03:24:32 GMT+0800 (China Standard Time)

@lenhattan86 I added a few assertions. Please build and run the latest commit, and post the stack trace for the crash.

Tan N. Le · Answer 7 · Thu Jul 20 2017 05:12:16 GMT+0800 (China Standard Time)

Please refer the attached log.

issue#11_assertion.zip

Interestingly, TF job does not send any message to Executor when I ran "python test_ops_tf.py TestBasicOps.test_multiply_int32"
Instead, I ran "python test_ops_tf.py TestBasicOps.test_noop" to have the log. It should be the nullptr one.

Aetf · Answer 8 · Sat Jul 22 2017 08:43:52 GMT+0800 (China Standard Time)

I just fixed a wrong assertion in code. Could you update and rebuild and rerun?

The log is not fully flushed when it crashes. You can run p logging::logger->flush() after the crash in gdb to flush the log. I need to know the exact op kernel running while the crash happens.