details of inter_op and intra_op parallelism threads

Question

details of inter_op and intra_op parallelism threads

mehransi opened this issue 2 years ago · comments

There exist two configuration parameters in TensorFlow serving to utilize CPU called tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism which tuning them can have great impact on the model server performance (throughput, latency). I could not find a good documentation for them in TensorFlow serving website. Can you please provide a detailed definition of these parameters? I saw here you defined tensorflow_inter_op_parallelism as a threadpool for independent ops. For those who are not ML engineer, some questions might arise:

What are independent ops?
Is there a way to identify what operations a model has (for example, resnet50)? I know intra_op_parallelism can be used to parallelize an operation like matrix multiplication, but what independent operation exist to utilize inter_op_parallelism threadpool?
How these threadpools related with rest_api_num_threads. Are the threadpools shared between different requests going to the model server?

Niraj Singh · Answer 1 · Tue Sep 20 2022 18:25:35 GMT+0800 (China Standard Time)

@mehransi,

For documentation on tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism, you can refer to TF config.proto file and GeneralBestPractices doc.

Answering your questions below.

Independent Ops are operations that are independent in your TensorFlow graph— because there is no directed path between them in the dataflow graph— TensorFlow will attempt to run them concurrently, using a thread pool with inter_op_parallelism_threads threads. For more details refer here.
You can search for ResNet50 architecture to identify the model's operations. Please refer here.
rest_api_num_threads are the number of threads for HTTP/REST API processing. If not set, will be auto set based on number of CPUs. Please refer here.

Mehran Salmani · Answer 2 · Sun Sep 25 2022 23:56:03 GMT+0800 (China Standard Time)

Thanks @singhniraj08 .
For the third question, there are multiple threads for accepting inference requests, is inter_op_parallelism thread pool shared between them (between all the requests to the server) or each rest api thread makes a separate inter_op_parallelism thread pool?

Peini Liu · Answer 3 · Mon Sep 26 2022 22:35:55 GMT+0800 (China Standard Time)

Hi all,
Thanks for aboving explanations. I am also confused about the behavior of inter_op_parallelism.
I tested different settings of inter and intra.
We have a machine with 32 cores, if we set

intra=32, inter=32, the 32 cores will be used. (as expected)
intra=2or4or.., inter=32, the used cores are the same as intra.
BUT if intra=1, inter=32, the 32 cores will be fully used again.

Could you explain why the inter behaves differently when intra=1?
Best,
Peini

Mehran Salmani · Answer 4 · Sun Oct 30 2022 19:29:45 GMT+0800 (China Standard Time)

@peiniliu
I think intra=1 does not mean we have intra_op_parallelism thread pool of size 1; It likely means that the intra_op_parallelism thread pool is disabled so processing of ops is not assigned to a thread pool.

Niraj Singh · Answer 5 · Mon Dec 19 2022 13:01:13 GMT+0800 (China Standard Time)

@mehransi @peiniliu,

As your question is not a bug/performance/feature request, I would recommend you to open this issue in TensorFlow Forum as there is a larger community there. Thanks!

Niraj Singh · Answer 6 · Fri Feb 17 2023 19:50:47 GMT+0800 (China Standard Time)

Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!