Deploying with model and tensor parallelism

Question

Deploying with model and tensor parallelism

subhalingamd opened this issue 4 months ago · comments

Hello,

Could you provide guidance on implementing model and tensor parallelism in a deployment setting, such as with NVIDIA Triton Inference Server?

While it worked when running in script mode, as described here, I'm uncertain how to make the process listen to requests and do inference as they come.

Minh-Thuc · Answer 1 · Tue Apr 02 2024 16:07:40 GMT+0800 (China Standard Time)

I haven't built yet a service with this feature. But the idea is that you create a service as the normal mode while keeping all the implementation of the server in the main process by checking:

if ctranslate2.MpiInfo.getCurRank() == 0:
...

Then instead of running a service with a command, you can run it with mpirun. It depends your own goal and your environment that you have to deploy to set up the network following the guidance of mpi.

Subhalingam D · Answer 2 · Thu Apr 04 2024 21:49:03 GMT+0800 (China Standard Time)

Hi @minhthuc2502, thanks for the quick response.

This would require calling the inference function inside the if ctranslate2.MpiInfo.getCurRank() == 0 block, right? In which case, I think the other processes don't receive the data itself. I tried this but it just keeps waiting.

I am adding a very simplified workflow/pseudo-code below.

import ctranslate2
import transformers

generator = ctranslate2.Generator(..., tensor_parallel=True)
tokenizer = transformers.AutoTokenizer.from_pretrained(...)

def inference(text):
    start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
    results = generator.generate_batch([start_tokens], ...)
    return results

if ctranslate2.MpiInfo.getCurRank() == 0:
    ...

    text = ...
    results = inference(text)

    ...

Things work fine without the if ctranslate2.MpiInfo.getCurRank() == 0 block and when run as a script.

Please let me know if this is different from what you had suggested. Thanks.

Minh-Thuc · Answer 3 · Thu Apr 04 2024 23:34:16 GMT+0800 (China Standard Time)

You have to run inference function in all processes and then only get the result from the first process:

import ctranslate2
import transformers

generator = ctranslate2.Generator(..., tensor_parallel=True)
tokenizer = transformers.AutoTokenizer.from_pretrained(...)

def inference(text):
    start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
    results = generator.generate_batch([start_tokens], ...)
    return results

result = inference(text)
if ctranslate2.MpiInfo.getCurRank() == 0:
    ...

    Handle result 

    ...

Subhalingam D · Answer 4 · Thu May 02 2024 15:14:46 GMT+0800 (China Standard Time)

Thank you for the assistance. I'm closing the issue since I've managed to get it working.