TFServing 2.10.0 crashes when slicing tensor

Question

TFServing 2.10.0 crashes when slicing tensor

battuzz opened this issue 2 years ago · comments

Andrea Battistello commented 2 years ago

Bug Report

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
TensorFlow Serving installed from (source or binary): docker image
TensorFlow Serving version: 2.10.0

Describe the problem

Loading a model that contains a slice() operation over a tensor immediately crashes the server with a std::bad_alloc

2022-09-19 14:23:00.234069: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2022-09-19 14:23:00.234113: I tensorflow_serving/model_servers/server_core.cc:594]  (Re-)adding model: mymodel
2022-09-19 14:23:00.402487: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: mymodel version: 1}
2022-09-19 14:23:00.402531: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mymodel version: 1}
2022-09-19 14:23:00.402561: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mymodel version: 1}
2022-09-19 14:23:00.403523: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /models/mymodel/1
2022-09-19 14:23:00.405195: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve }
2022-09-19 14:23:00.405232: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: /models/mymodel/1
2022-09-19 14:23:00.405965: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-19 14:23:00.424755: I external/org_tensorflow/tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-09-19 14:23:00.425703: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:229] Restoring SavedModel bundle.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/usr/bin/tf_serving_entrypoint.sh: line 3:     8 Aborted                 tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"

Exact Steps to Reproduce

I generated the following model:

import tensorflow as tf

def predict(x):
    return x[..., slice(None, None, None)]

module = tf.Module()
module.predict = tf.function(predict, input_signature=[tf.TensorSpec(name='x', dtype=tf.float64, shape=(None, 2))])
tf.saved_model.save(module, '<saved_model_location>', signatures={'predict' : module.predict})

And then started the server from docker image using the following models.config:

model_config_list {
  config {
    name: 'mymodel'
    base_path: '/models/mymodel'
    model_platform: 'tensorflow'
    model_version_policy {
      all {}
    }
  }
}

docker run --rm --name mytfserving -t  -p 9500:8500 -p 9501:8501 -v <my_saved_model_location>:/models tensorflow/serving:2.10.0 --model_config_file=/models/models.config

Additional information

I experimented also with previous versions of TFServing:

TFServing 2.9.2 : at first it loads the model and is in the READY status, but at the first call the server crashes with the same error message
TFServing 2.8 or lower : no problem loading the model and everything works

Loading the model from python and doing inference works fine:

import tensorflow as tf
model = tf.saved_model.load('<saved_model_location>')

model.signatures['predict'](tf.constant([[1., 2.]], dtype=tf.float64))  # --> returns {'output_0': <tf.Tensor: shape=(1, 2), dtype=float64, numpy=array([[1., 2.]])>}

Niraj Singh · Answer 1 · Tue Sep 20 2022 13:14:24 GMT+0800 (China Standard Time)

@battuzz,

This issue is already reported by other users. As a workaround, you can downgrade TF serving to 2.8.2.
Requesting you to close this issue and follow #2048 thread.

Andrea Battistello · Answer 2 · Tue Sep 20 2022 22:47:03 GMT+0800 (China Standard Time)

Doesn't seem exactly the same issue, though..
I would wait for a while and once #2048 is resolved I can check if this one will be fixed as well

Mathieu Guillame-Bert · Answer 3 · Fri Sep 23 2022 16:33:38 GMT+0800 (China Standard Time)

Hi,

If possible, can you share the failing stack trace of the crash following those instructions.

M.

Andrea Battistello · Answer 4 · Fri Sep 23 2022 17:25:26 GMT+0800 (China Standard Time)

Using TFServing 2.10.0 with the above model, the server stops immediately trying to load the model (so the curl request was not needed)

Here's the stack trace:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fcd7b51e7f1 in __GI_abort () at abort.c:79
#2  0x00007fcd7bb73957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fcd7bb79ae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fcd7bb79b21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fcd7bb79d54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fcd7bba2012 in std::__throw_bad_alloc() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x000055cd7b9fc3e3 in void absl::lts_20220623::inlined_vector_internal::Storage<long, 4ul, std::allocator<long> >::Resize<absl::lts_20220623::inlined_vector_internal::DefaultValueAdapter<std::allocator<long> > >(absl::lts_20220623::inlined_vector_internal::DefaultValueAdapter<std::allocator<long> >, unsigned long) ()
#8  0x000055cd8707ffc2 in tensorflow::ValidateStridedSliceOp(tensorflow::Tensor const*, tensorflow::Tensor const*, tensorflow::Tensor const&, tensorflow::PartialTensorShape const&, int, int, int, int, int, tensorflow::PartialTensorShape*, tensorflow::PartialTensorShape*, bool*, bool*, bool*, absl::lts_20220623::InlinedVector<long, 4ul, std::allocator<long> >*, absl::lts_20220623::InlinedVector<long, 4ul, std::allocator<long> >*, absl::lts_20220623::InlinedVector<long, 4ul, std::allocator<long> >*, tensorflow::StridedSliceShapeSpec*) ()
#9  0x000055cd81847db7 in tensorflow::{lambda(tensorflow::shape_inference::InferenceContext*)#34}::operator()(tensorflow::shape_inference::InferenceContext*) const [clone .isra.747] ()
#10 0x000055cd81848344 in std::_Function_handler<tensorflow::Status (tensorflow::shape_inference::InferenceContext*), tensorflow::{lambda(tensorflow::shape_inference::InferenceContext*)#34}>::_M_invoke(std::_Any_data const&, tensorflow::shape_inference::InferenceContext*&&) ()
#11 0x000055cd870b4322 in tensorflow::shape_inference::InferenceContext::Run(std::function<tensorflow::Status (tensorflow::shape_inference::InferenceContext*)> const&) ()
#12 0x000055cd81c8ffd1 in mlir::tfg::InferReturnTypeComponentsForTFOp(llvm::Optional<mlir::Location>, mlir::Operation*, mlir::ValueRange, long, llvm::function_ref<mlir::Attribute (mlir::Value)>, llvm::function_ref<tensorflow::shape_inference::ShapeHandle (tensorflow::shape_inference::InferenceContext&, mlir::OpResult)>, llvm::function_ref<mlir::Type (int)>, llvm::function_ref<tensorflow::Status (mlir::Operation*, llvm::StringRef, tensorflow::OpRegistrationData const*, bool, google::protobuf::Map<std::string, tensorflow::AttrValue>*)>, llvm::SmallVectorImpl<mlir::ShapedTypeComponents>&) ()
#13 0x000055cd81c890e6 in mlir::tfg::ShapeInference::runOnOperation()::{lambda(mlir::Operation*)#3}::operator()(mlir::Operation*) const ()
#14 0x000055cd81c89734 in mlir::WalkResult llvm::function_ref<mlir::WalkResult (mlir::Operation*)>::callback_fn<mlir::tfg::ShapeInference::runOnOperation()::{lambda(mlir::Operation*)#4}>(long, mlir::Operation*) ()
#15 0x000055cd86cb4faf in mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) ()
#16 0x000055cd86cb5046 in mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) ()
#17 0x000055cd86cb5046 in mlir::detail::walk(mlir::Operation*, llvm::function_ref<mlir::WalkResult (mlir::Operation*)>, mlir::WalkOrder) ()
#18 0x000055cd81c89814 in mlir::tfg::ShapeInference::runOnOperation() ()
#19 0x000055cd86c6f3e2 in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) ()
#20 0x000055cd86c6f9e2 in mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*)
    ()
#21 0x000055cd86c704a3 in mlir::PassManager::run(mlir::Operation*) ()
#22 0x000055cd81c21d9e in mlir::tfg::TFGGrapplerOptimizer::Optimize(tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem const&, tensorflow::GraphDef*) ()
#23 0x000055cd81a1dbf4 in tensorflow::grappler::MetaOptimizer::RunOptimizer(tensorflow::grappler::GraphOptimizer*, tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem*, tensorflow::GraphDef*, tensorflow::grappler::MetaOptimizer::GraphOptimizationResult*) ()
#24 0x000055cd81a1f43d in tensorflow::grappler::MetaOptimizer::OptimizeGraph(std::vector<std::unique_ptr<tensorflow::grappler::GraphOptimizer, std::default_delete<tensorflow::grappler::GraphOptimizer> >, std::allocator<std::unique_ptr<tensorflow::grappler::GraphOptimizer, std::default_delete<tensorflow::grappler::GraphOptimizer> > > > const&, tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem&&, tensorflow::GraphDef*) ()
#25 0x000055cd81a20be1 in tensorflow::grappler::MetaOptimizer::OptimizeGraph(tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem&&, tensorflow::GraphDef*) ()
#26 0x000055cd81a211c8 in tensorflow::grappler::MetaOptimizer::OptimizeConsumeItem(tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem&&, tensorflow::GraphDef*) ()
#27 0x000055cd81a23581 in tensorflow::grappler::RunMetaOptimizer(tensorflow::grappler::GrapplerItem&&, tensorflow::ConfigProto const&, tensorflow::DeviceBase*, tensorflow::grappler::Cluster*, tensorflow::GraphDef*) ()

Mathieu Guillame-Bert · Answer 5 · Fri Sep 23 2022 23:12:57 GMT+0800 (China Standard Time)

Thanks for the stack trace.

This look like the same problem as the one solved in this commit (or at least, this is the same failing function). To confirm if this is the case, wait for the next TF Serving nightly build, compile TF Serving nightly from source, or use a pre-compiled version of TF Serving nighly.

Andrea Battistello · Answer 6 · Sat Sep 24 2022 00:13:33 GMT+0800 (China Standard Time)

@achoum I tried the precompiled version of 2.11-nightly (the one with TF DF) and it works, thank you!
Do you have an estimate of when it will be merged into the next release? Will this fix be retro-applied also to TFServing 2.9 and 2.10?

Roberta Benassi · Answer 7 · Sun Oct 02 2022 19:52:42 GMT+0800 (China Standard Time)

We discovered the same issue raised by @battuzz. At the moment we can still use TF 2.8.X, but we would like to switch to a newer version in the next weeks.
Unfortunately, we cannot upgrade to 2.11 or use a nightly release in our environments.
It would be very appreciated if the fix could be available also for TF 2.9.X.
Do you think it could be feasible?

Mathieu Guillame-Bert · Answer 8 · Fri Oct 07 2022 15:27:00 GMT+0800 (China Standard Time)

TensorFlow 2.9 and 2.10 was patched [1, 2]. I'll update this issue if/when the TensorFlow Serving release is also patched.

Niraj Singh · Answer 9 · Fri Dec 02 2022 13:26:32 GMT+0800 (China Standard Time)

@battuzz,

TF serving 2.11.0 is released. Please try the new release and let us know if your issue has been resolved. Thank you!

Andrea Battistello · Answer 10 · Fri Dec 02 2022 18:24:53 GMT+0800 (China Standard Time)

I tested with

2.9.3
2.10.1
2.11.0

And they all fix the issue

Thank you!
Andrea