triton-inference-server / fastertransformer_backend

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about different intra-node settings for fastertransformer_backend and FasterTransformer

YJHMITWEB opened this issue · comments

Hi, I am wondering why in FasterTransformer, intra-node GPUs are bound to process-level, while in fastertransformer_backend, it is bound to thread-level? Since their src code are the same, why differs in intra-node binding?

multi-process is more flexible and stable because we can use it in multi-gpu and multi-node.
But in triton server, we hope multiple model instances can share same model, and hence we need to use multi-thread.

Hi @byshiue ,

Thanks for the reply. So I am a little bit confused here. When enabling tensor parallelism, FasterTransformer expects it happens intra-node. So for example, each node has 2 GPUs, and we set the tensor_parallel=2, then when loading the model, the weights will be sliced into two parts, and each GPU loads one part. In this case, what do you mean by "we hope multiple model instances can share same model, and hence we need to use multi-thread." As in this case, each thread is responsible for different weights.

Is my understanding correct?

multi-instances is independent to tp.
It is simpler to demonstrate on single gpu. Assume we have single gpu, we create a gpt model on this gpu, and then create 2 model instances based on the gpt model. Then, these two instances can handle different requests and share same weights.

Oh, I see. I totally get it, it basically serves for handling different requests. Thanks for the explanation!