Process hangs when using `tensor_parallel_size` and `data_parallel_size` together

Question

Process hangs when using `tensor_parallel_size` and `data_parallel_size` together

harshakokel opened this issue a month ago · comments

Hello,

I noticed that my process hangs at results = ray.get(object_refs) when I use data_parallel_size as well as tensor_parallel_size for vllm models.

For example, this call would hang.

lm_eval --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

These would not.

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=1,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=1 --tasks arc_easy --output ./trial/  --log_samples --limit 10

Does anyone else face similar problem?

Hailey Schoelkopf · Answer 1 · Fri Apr 26 2024 23:17:23 GMT+0800 (China Standard Time)

Hi! What version of vLLM are you running with?

@baberabb has observed some problems like this before with later versions ( >v0.3.3 I believe) of vllm.

Harsha · Answer 2 · Sat Apr 27 2024 00:41:04 GMT+0800 (China Standard Time)

I am on vllm 0.3.2.

Harsha · Answer 3 · Sat Apr 27 2024 00:42:53 GMT+0800 (China Standard Time)

Is this a vllm problem? Should I be raising an issue on that repo?

Baber Abbasi · Answer 4 · Sat Apr 27 2024 01:13:27 GMT+0800 (China Standard Time)

Hey. Have you tried caching the weights by running with DP=1 until they are downloaded? I found it prone to hang with DP otherwise.

Harsha · Answer 5 · Sat Apr 27 2024 01:44:58 GMT+0800 (China Standard Time)

Yes, the weights are cached. The process is hanging after llm.generate returns results.

Baber Abbasi · Answer 6 · Sat Apr 27 2024 02:48:33 GMT+0800 (China Standard Time)

Yes, the weights are cached. The process is hanging after llm.generate returns results.

hmm. It's working for me with 0.3.2. Have you tried running on a fresh virtual environment?

Harsha · Answer 7 · Sat Apr 27 2024 04:10:40 GMT+0800 (China Standard Time)

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

Baber Abbasi · Answer 8 · Sat Apr 27 2024 20:01:21 GMT+0800 (China Standard Time)

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

Probably the latest one. I installed it with pip install -e ".[vllm]" on runpod with 4 GPUs.