EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

Home Page:https://www.eleuther.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Process hangs when using `tensor_parallel_size` and `data_parallel_size` together

harshakokel opened this issue · comments

Hello,

I noticed that my process hangs at results = ray.get(object_refs) when I use data_parallel_size as well as tensor_parallel_size for vllm models.

For example, this call would hang.

lm_eval --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

These would not.

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=1,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10
lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=1 --tasks arc_easy --output ./trial/  --log_samples --limit 10

Does anyone else face similar problem?

Hi! What version of vLLM are you running with?

@baberabb has observed some problems like this before with later versions ( >v0.3.3 I believe) of vllm.

I am on vllm 0.3.2.

Is this a vllm problem? Should I be raising an issue on that repo?

Hey. Have you tried caching the weights by running with DP=1 until they are downloaded? I found it prone to hang with DP otherwise.

Yes, the weights are cached. The process is hanging after llm.generate returns results.

Yes, the weights are cached. The process is hanging after llm.generate returns results.

hmm. It's working for me with 0.3.2. Have you tried running on a fresh virtual environment?

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

Probably the latest one. I installed it with pip install -e ".[vllm]" on runpod with 4 GPUs.