tensorflow / serving

A flexible, high-performance serving system for machine learning models

Home Page:https://www.tensorflow.org/serving

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segmentation fault with tcmalloc and std memory allocators

tomzx opened this issue · comments

Bug Report

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.1 LTS (running in GKE)
  • TensorFlow Serving installed from (source or binary): binary (from tensorflow/serving:2.11.0 docker image)
  • TensorFlow Serving version: 2.11.0

Describe the problem

When running tensorflow serving with a custom memory allocator (tcmalloc), after a period of time in the event loop (generally less than 1 minute as long as there is load) tensorflow serving will crash due to a segmentation fault.

Similar issues (std::bad_alloc) were present in tensorflow serving starting with 2.9+ when using tcmalloc.

The issue is not present in 2.8.3.

Exact Steps to Reproduce

No reproduction steps at this time.

Source code / logs

Here are backtraces generated when the segmentation fault occurs.

#0  0x00007fbea81dbe93 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int) () from /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
#1  0x00007fbea81dc1fe in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
#2  0x000055a1440aa6a2 in dnnl_primitive_desc_destroy ()
#3  0x000055a13b8db316 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#4  0x000055a13ec9658e in std::_Sp_counted_ptr<dnnl::inner_product_forward::primitive_desc*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
#5  0x000055a13b8db316 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#6  0x000055a1401d3dbb in tensorflow::MklDnnMatMulFwdPrimitive<float, float, float, float, float>::Setup(tensorflow::MklDnnMatMulFwdParams const&) ()
#7  0x000055a1401d5b5a in tensorflow::MklDnnMatMulFwdPrimitiveFactory<float, float, float, float, float>::Get(tensorflow::MklDnnMatMulFwdParams const&, bool) ()
#8  0x000055a1401d6e18 in tensorflow::MklFusedMatMulOp<Eigen::ThreadPoolDevice, float, true>::Compute(tensorflow::OpKernelContext*) ()
#9  0x000055a142530ac8 in tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
#10 0x000055a142585820 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) ()
#11 0x000055a14258698c in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) ()
#12 0x000055a148210621 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) ()
#13 0x000055a14820e573 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#14 0x000055a14803e8a5 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) ()
#15 0x00007fbea7cedb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007fbea7d7fa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
#0  0x000055640ec953aa in tensorflow::MklDnnMatMulFwdPrimitive<float, float, float, float, float>::Execute(float const*, float const*, float const*, float*, void*, std::shared_ptr<dnnl::stream>) ()
#1  0x000055640ec9b7f5 in tensorflow::MklFusedMatMulOp<Eigen::ThreadPoolDevice, float, true>::Compute(tensorflow::OpKernelContext*) ()
#2  0x0000556410ff4ac8 in tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
#3  0x0000556411049820 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) ()
#4  0x000055641104a98c in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) ()
#5  0x0000556416cd4621 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) ()
#6  0x0000556416cd2573 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#7  0x0000556416b028a5 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) ()
#8  0x00007f04b9a8eb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9  0x00007f04b9b20a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
#0  0x00007f2c6c1d0eeb in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int) () from /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
#1  0x00007f2c6c1d11fe in tcmalloc::ThreadCache::Scavenge() () from /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
#2  0x000055e25aac33f6 in Eigen::internal::TensorBlockScratchAllocator<Eigen::ThreadPoolDevice>::~TensorBlockScratchAllocator() ()
#3  0x000055e25d2f1734 in Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, Eigen::TensorBroadcastingOp<Eigen::IndexList<long, Eigen::type2index<1l> > const, Eigen::TensorReshapingOp<Eigen::IndexList<Eigen::type2index<1l>, long> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const> const> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const> const> const, Eigen::ThreadPoolDevice, true, (Eigen::internal::TiledEvaluation)1>::run(Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, Eigen::TensorBroadcastingOp<Eigen::IndexList<long, Eigen::type2index<1l> > const, Eigen::TensorReshapingOp<Eigen::IndexList<Eigen::type2index<1l>, long> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const> const> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const> const> const&, Eigen::ThreadPoolDevice const&) ()
#4  0x000055e25d318523 in tensorflow::BinaryOp<Eigen::ThreadPoolDevice, tensorflow::functor::mul<float> >::Compute(tensorflow::OpKernelContext*) ()
#5  0x000055e260ccaac8 in tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
#6  0x000055e260d1f820 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) ()
#7  0x000055e260d2098c in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) ()
#8  0x000055e2669aa621 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) ()
#9  0x000055e2669a8573 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#10 0x000055e2667d88a5 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) ()
#11 0x00007f2c6bce2b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007f2c6bd74a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
#0  0x00007f40588a4ebb in tc_memalign () from /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
#1  0x00007f40588a4fda in tc_posix_memalign () from /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
#2  0x0000562df89f5cf4 in tsl::port::AlignedMalloc(unsigned long, int) ()
#3  0x0000562df856d688 in tensorflow::Tensor::Tensor(tsl::Allocator*, tensorflow::DataType, tensorflow::TensorShape const&) ()
#4  0x0000562dee05fcf8 in tensorflow::Tensor& std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >::emplace_back<tensorflow::DataType, tensorflow::TensorShape>(tensorflow::DataType&&, tensorflow::TensorShape&&) ()
#5  0x0000562df84e1972 in tensorflow::example::FastParseExample(tensorflow::example::FastParseExampleConfig const&, absl::lts_20220623::Span<tsl::tstring const>, absl::lts_20220623::Span<tsl::tstring const>, tsl::thread::ThreadPool*, tensorflow::example::Result*) ()
#6  0x0000562dedc562a6 in tensorflow::ParseExampleOp::Compute(tensorflow::OpKernelContext*) ()
#7  0x0000562df2d05ac8 in tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
#8  0x0000562df2d5a820 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) ()
#9  0x0000562df2d5b98c in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) ()
#10 0x0000562df89e5621 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) ()
#11 0x0000562df89e3573 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#12 0x0000562df88138a5 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) ()
#13 0x00007f40583afb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007f4058441a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
#0  0x00007fbf95329717 in tc_newarray () from /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
#1  0x0000562785fcf579 in std::unordered_map<int, dnnl::memory, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, dnnl::memory> > >& std::vector<std::unordered_map<int, dnnl::memory, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, dnnl::memory> > >, std::allocator<std::unordered_map<int, dnnl::memory, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, dnnl::memory> > > > >::emplace_back<std::unordered_map<int, dnnl::memory, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, dnnl::memory> > > >(std::unordered_map<int, dnnl::memory, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, dnnl::memory> > >&&) ()
#2  0x0000562787a31bb6 in tensorflow::MklDnnMatMulFwdPrimitive<float, float, float, float, float>::Setup(tensorflow::MklDnnMatMulFwdParams const&) ()
#3  0x0000562787a33b5a in tensorflow::MklDnnMatMulFwdPrimitiveFactory<float, float, float, float, float>::Get(tensorflow::MklDnnMatMulFwdParams const&, bool) ()
#4  0x0000562787a34e18 in tensorflow::MklFusedMatMulOp<Eigen::ThreadPoolDevice, float, true>::Compute(tensorflow::OpKernelContext*) ()
#5  0x0000562789d8eac8 in tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
#6  0x0000562789de3820 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) ()
#7  0x0000562789de498c in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) ()
#8  0x000056278fa6e621 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) ()
#9  0x000056278fa6c573 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#10 0x000056278f89c8a5 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) ()
#11 0x00007fbf94e34b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007fbf94ec6a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@tomzx,

Similar issue #2048 was reported for std::bad_alloc error and the issue has been fixed.
Can you please try with following build and let us know if the issue persists. 2.9.3, 2.10.1 and 2.11.0

If the issue persists, requesting you to share the exacts steps and commands to reproduce the issue so we can replicate it on our end. Thank you!

@singhniraj08 2.9.3, 2.10.1 and 2.11.0 all segmentation fault. 2.8.4 works fine.

We have two models that are fairly similar and for one model we do not experience crashes, while we do for the other. We're trying to identify what the differences are with the hope of identifying the culprit.

I've also tested without tcmalloc. I get a core dump too, but I get the following errors before the crash (no backtrace this time, sorry):

2.9.3
free(): invalid next size (fast)
2022-12-09 21:50:24.204337: F external/org_tensorflow/tensorflow/core/framework/tensor.cc:729] Check failed: IsAligned() ptr = 0x20
corrupted double-linked list

2.10.1
munmap_chunk(): invalid pointer
malloc(): invalid size (unsorted)
free(): invalid next size (fast)
malloc(): unaligned tcache chunk detected

@tomzx,

Please share the model code which crashes and commands to reproduce the issue so we can replicate it on our end. Thank you!

Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!

2.13.0 seems to work properly again, albeit with worse inference performance.
TF_ENABLE_ONEDNN_OPTS=0 brings back the original performance under tensorflow serving 2.8.

@tomzx Why did TF_ENABLE_ONEDNN_OPTS=0 solve the issue?