Model will be loaded on different devices when using multiple gpus.
baichuanzhou opened this issue · comments
It appears that models will be loaded on different gpus when setting num_processes
to more than one, which will cause error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
Here's my command to launch:
accelerate launch --num_processes=2 -m lmms_eval --model llava --model_args pretrained="xxx,conv_template=xxx" --tasks gqa,vqav2,scienceqa,textvqa --batch_size 1 --log_samples --log_samples_suffix xxx --output_path ./logs/
I found a temporary fix by installing previous version:
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git@bf4c78b7e405e2ca29bf76f579371382fec3dd02
and in this version multi-gpu inference works fine.
May I ask in which line of inference did this error occur?
Sorry for the delay.
Here is one error message:
[lmms_eval/models/llava.py:386] ERROR Error Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution) in generating.
You might also want to try setting device_map=auto in your model_args when you do multi-processing
--model_args pretrained=xxx,conv_template=xxx,device_map=auto
Setting device_map to auto didn't do the trick. Here's my command:
srun -p xxx --gres=gpu:4 accelerate launch --num_processes=4 --main_process_port 19500 -m lmms_eval --model llava --model_args pretrained="xxx,conv_template=xxx,device_map=auto" --task textvqa_val,vizwiz_vqa_val,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_hermes2_llama3_merged_data_v1.1_anyres_tune_vit --output_path ./logs/ #
I noticed one difference between evaluation using v0.1.2
and bf4c78b7e405e2ca29bf76f579371382fec3dd02
was this logger information:
v0.1.2
:[lmms_eval/models/llava.py:124] INFO Using single device: cuda
bf4c78b7e405e2ca29bf76f579371382fec3dd02
: lmms_eval/models/llava.py:104] INFO Using 4 devices with data parallelism
Line 104 appears to be here.
Sorry, my bad
Should set device_map=""
when using multiprocess. Set device_map=auto
only when you use num_processes=1
Thanks. Now it works!