OpenNMT / CTranslate2

Fast inference engine for Transformer models

Home Page:https://opennmt.net/CTranslate2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deploying with model and tensor parallelism

subhalingamd opened this issue · comments

Hello,

Could you provide guidance on implementing model and tensor parallelism in a deployment setting, such as with NVIDIA Triton Inference Server?

While it worked when running in script mode, as described here, I'm uncertain how to make the process listen to requests and do inference as they come.

I haven't built yet a service with this feature. But the idea is that you create a service as the normal mode while keeping all the implementation of the server in the main process by checking:

if ctranslate2.MpiInfo.getCurRank() == 0:
...

Then instead of running a service with a command, you can run it with mpirun. It depends your own goal and your environment that you have to deploy to set up the network following the guidance of mpi.

Hi @minhthuc2502, thanks for the quick response.

This would require calling the inference function inside the if ctranslate2.MpiInfo.getCurRank() == 0 block, right? In which case, I think the other processes don't receive the data itself. I tried this but it just keeps waiting.

I am adding a very simplified workflow/pseudo-code below.

import ctranslate2
import transformers

generator = ctranslate2.Generator(..., tensor_parallel=True)
tokenizer = transformers.AutoTokenizer.from_pretrained(...)

def inference(text):
    start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
    results = generator.generate_batch([start_tokens], ...)
    return results

if ctranslate2.MpiInfo.getCurRank() == 0:
    ...

    text = ...
    results = inference(text)

    ...

Things work fine without the if ctranslate2.MpiInfo.getCurRank() == 0 block and when run as a script.

Please let me know if this is different from what you had suggested. Thanks.

You have to run inference function in all processes and then only get the result from the first process:

import ctranslate2
import transformers

generator = ctranslate2.Generator(..., tensor_parallel=True)
tokenizer = transformers.AutoTokenizer.from_pretrained(...)

def inference(text):
    start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
    results = generator.generate_batch([start_tokens], ...)
    return results

result = inference(text)
if ctranslate2.MpiInfo.getCurRank() == 0:
    ...

    Handle result 

    ...

Thank you for the assistance. I'm closing the issue since I've managed to get it working.