SymbioticLab / Salus

Fine-grained GPU sharing primitives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segmentation Fault - sender becomes nullptr in ITask::prepare

lenhattan86 opened this issue · comments

Version:

Procedures:

  1. Build & install both Tensorflow (for CUDA) & Executor by gcc 5.4

  2. Run Executor (using gdb to get the full log)

  3. Run the test job: python test_ops_tf.py TestBasicOps.test_multiply

  4. Observe the segmentation fault

Expected:
no segmentation fault
Actual Result
Segmentation fault

Full log: log.tar.gz

root cause: passing shared_ptr to a promise function like q::promise RpcServerCore::dispatch may cause nullptr error.

Fixed as below.
// [issue#11] passing shared_ptr "sender" may cause nullptr segmentationf fault
// as dispatch is executed in an asynchronous manner.
//auto f = m_pLogic->dispatch(sender, *pEvenlop, *pRequest)
auto f = m_pLogic->dispatch(std::make_shared(*this, pEvenlop->seq(), std::move(identities)), *pEvenlop, *pRequest)

This does not occur in revision 33ce74e.
So, I rolled this fix back.

This one indeed occurs time to time. It occurs again when I test issue #8 @ rev. 33ce74e.
I put the fix back.

commented

I doubt this is the root cause of the crash. The body of dispatch is executed synchronously, in which a task is created with sender moved in. The sender is created only a few lines above (in step 1 at line 279), so it won't be nullptr anyway.

Your fix also actually creates another issue. Because of std::move, the identity is moved into sender at line 279, after that identity will be empty. So you are passing in empty identity to SenderImpl constructor, which will cause some reply message not received by TF.

commented

Note that sender is shared_ptr, which is reference counted. So passing it around by value is fine among threads.

commented

@lenhattan86 I added a few assertions. Please build and run the latest commit, and post the stack trace for the crash.

Please refer the attached log.

issue#11_assertion.zip

Interestingly, TF job does not send any message to Executor when I ran "python test_ops_tf.py TestBasicOps.test_multiply_int32"
Instead, I ran "python test_ops_tf.py TestBasicOps.test_noop" to have the log. It should be the nullptr one.

commented

I just fixed a wrong assertion in code. Could you update and rebuild and rerun?

The log is not fully flushed when it crashes. You can run p logging::logger->flush() after the crash in gdb to flush the log. I need to know the exact op kernel running while the crash happens.