Model will be loaded on different devices when using multiple gpus.

Question

Model will be loaded on different devices when using multiple gpus.

baichuanzhou opened this issue a month ago · comments

It appears that models will be loaded on different gpus when setting num_processes to more than one, which will cause error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Here's my command to launch:

accelerate launch --num_processes=2 -m lmms_eval --model llava   --model_args pretrained="xxx,conv_template=xxx"   --tasks gqa,vqav2,scienceqa,textvqa --batch_size 1 --log_samples --log_samples_suffix xxx --output_path ./logs/

I found a temporary fix by installing previous version:
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git@bf4c78b7e405e2ca29bf76f579371382fec3dd02
and in this version multi-gpu inference works fine.

Kaichen Zhang - NTU · Answer 1 · Sun Apr 28 2024 11:08:46 GMT+0800 (China Standard Time)

May I ask in which line of inference did this error occur?

baichuanzhou · Answer 2 · Tue May 07 2024 11:13:11 GMT+0800 (China Standard Time)

Sorry for the delay.

Here is one error message:

[lmms_eval/models/llava.py:386] ERROR Error Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution) in generating.

Kaichen Zhang - NTU · Answer 3 · Tue May 07 2024 13:36:02 GMT+0800 (China Standard Time)

You might also want to try setting device_map=auto in your model_args when you do multi-processing

--model_args pretrained=xxx,conv_template=xxx,device_map=auto

baichuanzhou · Answer 4 · Tue May 07 2024 14:00:38 GMT+0800 (China Standard Time)

Setting device_map to auto didn't do the trick. Here's my command:

srun -p xxx --gres=gpu:4 accelerate launch --num_processes=4 --main_process_port 19500 -m lmms_eval --model llava   --model_args pretrained="xxx,conv_template=xxx,device_map=auto"   --task textvqa_val,vizwiz_vqa_val,mmbench_en --batch_size 1 --log_samples --log_samples_suffix llava_hermes2_llama3_merged_data_v1.1_anyres_tune_vit --output_path ./logs/ #

I noticed one difference between evaluation using v0.1.2 and bf4c78b7e405e2ca29bf76f579371382fec3dd02 was this logger information:
v0.1.2:[lmms_eval/models/llava.py:124] INFO Using single device: cuda
bf4c78b7e405e2ca29bf76f579371382fec3dd02: lmms_eval/models/llava.py:104] INFO Using 4 devices with data parallelism

Line 104 appears to be here.

Kaichen Zhang - NTU · Answer 5 · Tue May 07 2024 15:02:34 GMT+0800 (China Standard Time)

Sorry, my bad

Should set device_map="" when using multiprocess. Set device_map=auto only when you use num_processes=1

baichuanzhou · Answer 6 · Tue May 07 2024 19:26:58 GMT+0800 (China Standard Time)

Thanks. Now it works!