[Usage] How to inference with multi-GPUs in single machine? Possible to do batch inference?
ee2110 opened this issue · comments
ee2110 commented
Hi, I have a problem with using Vicuna13b-v1.3 to make an inference with multi-GPU. Could anyone please provide an example of code used for multi-GPU inference without the CLI? On the other hand, is it possible to do batch inference (eg. input a list of prompts and output a list of answers)?
I have tried adjust the num_gpus=2
, but seems like it still only compute using single GPU instead of two.
Here is the code
class Vicuna():
def __init__(self):
print('Initialize Vicuna...')
self.model, self.tokenizer = load_model(
'lmsys/vicuna-13b-v1.3',
device='cuda',
num_gpus=2
)
@torch.inference_mode()
def respond(self, input_msg):
conv = get_conversation_template('lmsys/vicuna-13b-v1.3')
conv.append_message(conv.roles[0], input_msg)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = self.tokenizer([prompt]).input_ids
output_ids = self.model.generate(
torch.as_tensor(input_ids).cuda(),
do_sample=True,
temperature=0.001,
repetition_penalty=1.0,
max_new_tokens=512,
)
output_ids = output_ids[0][len(input_ids[0]) :]
outputs = self.tokenizer.decode(
output_ids, skip_special_tokens=True, spaces_between_special_tokens=False
)
return outputs
def main():
vicuna_model = Vicuna()
answer= vicuna_model.respond("Who are you?")
print(answer)
if __name__ == "__main__":
main()
Brandon Biggs commented
I run a few models on multiple GPUs. Here's my bash script to launch an API:
export CUDA_VISIBLE_DEVICES=0,1
python3 -m fastchat.serve.model_worker \
--num-gpus 2 \
--model-path /path/to/model \
--host host.example.com \
--port 50000 \
--worker-address https://host.example.com:50000 \
--controller-address https://controller.example.com
Alexandre Strube commented
Same for me:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python3 $FASTCHAT/fastchat/serve/model_worker.py \
--controller $FASTCHAT_CONTROLLER:$FASTCHAT_CONTROLLER_PORT \
--port 31029 \
--worker http://$(hostname):31029 \
--num-gpus 8 \
--model-path models/Mixtral-8x22B-v0.1
vLLM also works multi-gpu just fine. SGLang doesn't.