tensorflow / serving

A flexible, high-performance serving system for machine learning models

Home Page:https://www.tensorflow.org/serving

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

details of inter_op and intra_op parallelism threads

mehransi opened this issue · comments

There exist two configuration parameters in TensorFlow serving to utilize CPU called tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism which tuning them can have great impact on the model server performance (throughput, latency). I could not find a good documentation for them in TensorFlow serving website. Can you please provide a detailed definition of these parameters? I saw here you defined tensorflow_inter_op_parallelism as a threadpool for independent ops. For those who are not ML engineer, some questions might arise:

  1. What are independent ops?
  2. Is there a way to identify what operations a model has (for example, resnet50)? I know intra_op_parallelism can be used to parallelize an operation like matrix multiplication, but what independent operation exist to utilize inter_op_parallelism threadpool?
  3. How these threadpools related with rest_api_num_threads. Are the threadpools shared between different requests going to the model server?

@mehransi,

For documentation on tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism, you can refer to TF config.proto file and GeneralBestPractices doc.

Answering your questions below.

  1. Independent Ops are operations that are independent in your TensorFlow graph— because there is no directed path between them in the dataflow graph— TensorFlow will attempt to run them concurrently, using a thread pool with inter_op_parallelism_threads threads. For more details refer here.

  2. You can search for ResNet50 architecture to identify the model's operations. Please refer here.

  3. rest_api_num_threads are the number of threads for HTTP/REST API processing. If not set, will be auto set based on number of CPUs. Please refer here.

Thanks @singhniraj08 .
For the third question, there are multiple threads for accepting inference requests, is inter_op_parallelism thread pool shared between them (between all the requests to the server) or each rest api thread makes a separate inter_op_parallelism thread pool?

Hi all,
Thanks for aboving explanations. I am also confused about the behavior of inter_op_parallelism.
I tested different settings of inter and intra.
We have a machine with 32 cores, if we set

  • intra=32, inter=32, the 32 cores will be used. (as expected)
  • intra=2or4or.., inter=32, the used cores are the same as intra.
  • BUT if intra=1, inter=32, the 32 cores will be fully used again.

Could you explain why the inter behaves differently when intra=1?
Best,
Peini

@peiniliu
I think intra=1 does not mean we have intra_op_parallelism thread pool of size 1; It likely means that the intra_op_parallelism thread pool is disabled so processing of ops is not assigned to a thread pool.

@mehransi @peiniliu,

As your question is not a bug/performance/feature request, I would recommend you to open this issue in TensorFlow Forum as there is a larger community there. Thanks!

Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!