Questions about different intra-node settings for fastertransformer_backend and FasterTransformer

Question

Questions about different intra-node settings for fastertransformer_backend and FasterTransformer

YJHMITWEB opened this issue a year ago · comments

Hi, I am wondering why in FasterTransformer, intra-node GPUs are bound to process-level, while in fastertransformer_backend, it is bound to thread-level? Since their src code are the same, why differs in intra-node binding?

byshiue_NV · Answer 1 · Mon Apr 03 2023 08:54:00 GMT+0800 (China Standard Time)

multi-process is more flexible and stable because we can use it in multi-gpu and multi-node.
But in triton server, we hope multiple model instances can share same model, and hence we need to use multi-thread.

Jinghan Yao · Answer 2 · Tue Apr 04 2023 21:10:10 GMT+0800 (China Standard Time)

Hi @byshiue ,

Thanks for the reply. So I am a little bit confused here. When enabling tensor parallelism, FasterTransformer expects it happens intra-node. So for example, each node has 2 GPUs, and we set the tensor_parallel=2, then when loading the model, the weights will be sliced into two parts, and each GPU loads one part. In this case, what do you mean by "we hope multiple model instances can share same model, and hence we need to use multi-thread." As in this case, each thread is responsible for different weights.

Is my understanding correct?

byshiue_NV · Answer 3 · Thu Apr 06 2023 14:21:53 GMT+0800 (China Standard Time)

multi-instances is independent to tp.
It is simpler to demonstrate on single gpu. Assume we have single gpu, we create a gpt model on this gpu, and then create 2 model instances based on the gpt model. Then, these two instances can handle different requests and share same weights.

Jinghan Yao · Answer 4 · Thu Apr 06 2023 23:15:37 GMT+0800 (China Standard Time)

Oh, I see. I totally get it, it basically serves for handling different requests. Thanks for the explanation!