lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Usage] How to inference with multi-GPUs in single machine? Possible to do batch inference?

ee2110 opened this issue · comments

Hi, I have a problem with using Vicuna13b-v1.3 to make an inference with multi-GPU. Could anyone please provide an example of code used for multi-GPU inference without the CLI? On the other hand, is it possible to do batch inference (eg. input a list of prompts and output a list of answers)?

I have tried adjust the num_gpus=2, but seems like it still only compute using single GPU instead of two.
Here is the code

class Vicuna():
    def __init__(self):
        print('Initialize Vicuna...')
        self.model, self.tokenizer = load_model(
            'lmsys/vicuna-13b-v1.3',
            device='cuda',
            num_gpus=2
        )

    @torch.inference_mode()
    def respond(self, input_msg):
        conv = get_conversation_template('lmsys/vicuna-13b-v1.3')
        conv.append_message(conv.roles[0], input_msg)
        conv.append_message(conv.roles[1], None)
        prompt = conv.get_prompt()

        input_ids = self.tokenizer([prompt]).input_ids
        output_ids = self.model.generate(
            torch.as_tensor(input_ids).cuda(),
            do_sample=True,
            temperature=0.001,
            repetition_penalty=1.0,
            max_new_tokens=512,
        )

        output_ids = output_ids[0][len(input_ids[0]) :]
        outputs = self.tokenizer.decode(
            output_ids, skip_special_tokens=True, spaces_between_special_tokens=False
        )
        return outputs


 

        
def main():
    vicuna_model = Vicuna()
    answer= vicuna_model.respond("Who are you?")
    print(answer)


if __name__ == "__main__":
    main()

I run a few models on multiple GPUs. Here's my bash script to launch an API:

export CUDA_VISIBLE_DEVICES=0,1
python3 -m fastchat.serve.model_worker \
--num-gpus 2 \
--model-path /path/to/model \
--host host.example.com \
--port 50000 \
--worker-address https://host.example.com:50000 \
--controller-address https://controller.example.com

Same for me:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python3 $FASTCHAT/fastchat/serve/model_worker.py \
 --controller $FASTCHAT_CONTROLLER:$FASTCHAT_CONTROLLER_PORT \
 --port 31029 \
 --worker http://$(hostname):31029 \
 --num-gpus 8 \
 --model-path models/Mixtral-8x22B-v0.1

vLLM also works multi-gpu just fine. SGLang doesn't.