SymbioticLab / Salus

Fine-grained GPU sharing primitives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Correctly handle output allocation attributes

Aetf opened this issue · comments

commented

Currently the executor assumes that an TF op kernel accepts inputs and produces outputs on the device the op runs, doesn't take into account the allocation attributes. This causes problem for GPU computation of int32, which registers special kernel that has all inputs/outputs explicitly placed on host (i.e. CPU).

Thus the memory access error is triggered when

  1. A HostConstantOp creates tensor on CPU, but the registered allocation attribute is still on GPU
  2. A BinaryOp<CPUDevice> tries to access memory, but failed. It is still not very clear why it fails this way, it should work in theory. There must be something else I'm missing.

Steps to reproduce

  1. launch executor
  2. export EXEC_SCHED_USE_GPU=1
  3. run test case python test_ops_tf.py TestBasicOps.test_multiply_int32.

Expected

Test passes

Actual

Segmentation fault in kernel computation.

Logs:

  • Executor log showing clearly the instantiation of kernels.
  • Stack trace of the crash
#0  0x00003fffb06eb26c in std::_Function_handler<void (long, long), Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<int, int>, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const> const> const, Eigen::ThreadPoolDevice, true>::run(Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<int, int>, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<int const, 1, 1, long>, 16, Eigen::MakePointer> const> const> const&, Eigen::ThreadPoolDevice const&)::{lambda(long, long)#1}>::_M_invoke(std::_Any_data const&, long&&, std::_Any_data const&) () from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#1  0x00003fffaf03a1d8 in Eigen::ThreadPoolDevice::parallelFor(long, Eigen::TensorOpCost const&, std::function<long (long)>, std::function<void (long, long)>) const () from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#2  0x00003fffb06fada0 in tensorflow::BinaryOp<Eigen::ThreadPoolDevice, tensorflow::functor::mul<int> >::Compute(tensorflow::OpKernelContext*) ()
   from /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/tensorflow-rpcdev/bazel-bin/tensorflow/libtensorflow_kernels.so
#3  0x00003fffb0e62e4c in tensorflow::BaseGPUDevice::ComputeHelper (this=0x3fff718d3000, op_kernel=0x3bffb00023b0, context=0x3bffb0002b40)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:389
#4  0x00003fffb0e63298 in tensorflow::BaseGPUDevice::Compute (this=0x3fff718d3000, op_kernel=0x3bffb00023b0, context=0x3bffb0002b40)
    at tensorflow/core/common_runtime/gpu/gpu_device.cc:331
#5  0x0000000010121acc in TFRunTask::run (this=0x10b9c1b0) at /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/executor/src/oplibraries/tfoplibrary.cpp:272
#6  0x000000001009318c in ITask::run<executor::RunResponse> (this=0x10b9c1b0)
    at /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/executor/src/oplibraries/ioplibrary.h:54
#7  0x000000001008a270 in q::promise<std::unique_ptr<executor::RunResponse, std::default_delete<executor::RunResponse> > > ExecutionEngine::enqueue<executor::RunResponse>(std::unique_ptr<ITask, std::default_delete<ITask> >&&)::{lambda(auto:1, auto:2)#1}::operator()<q::promise<q::remove_rvalue_reference<std::unique_ptr<executor::RunResponse, std::default_delete<executor::RunResponse> > >::type> q::make_promise_of<std::unique_ptr<executor::RunResponse, std::default_delete<executor::RunResponse> >, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const&, {lambda(auto:1, auto:2)#1}&&)::{lambda()#1}::operator()()::{lambda(std::default_delete<executor::RunResponse>&&)#1}, q::remove_rvalue_reference<std::default_delete<executor::RunResponse> >::type {lambda(auto:1, auto:2)#1}::operator()<std::default_delete<executor::RunResponse>, {lambda(auto:1, auto:2)#1}>(std::shared_ptr<q::queue> const, std::shared_ptr<q::queue> const&)::{lambda(auto:1, auto:2)#1}&&::operator()()::{lambda(auto:1)#2}> (__closure=0x3fff7fffe318, resolve=..., reject=...)
    at /gpfs/gpfs0/groups/chowdhury/peifeng/buildbed/executor/src/execution/executionengine.h:60

There's also some background info at #1.

commented

This is the exactly same issue as in #14. Closing this one.