Issues when running evaluation with multiple processes

Question

Issues when running evaluation with multiple processes

simplelifetime opened this issue 2 months ago · comments

Below is the error message.

[lmms_eval/models/llava.py:374] ERROR Error Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:4! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution) in generating

Although there are errors encountered, the program can still carry on. The results are all N/A or 0 in MME. What might cause the problem, I am hoping if it can run data parrellel on 8 GPUs.

Kaichen Zhang - NTU · Answer 1 · Wed Apr 10 2024 08:25:18 GMT+0800 (China Standard Time)

May I ask what is your command to launch the script? Because when their is error, we simply append a dummy empty string to the answer and thus will not have any score.

You either set num_processes > 1 or device_map=auto. I think you can not set num_processes > 1 and device_map=auto at the same time

jason law · Answer 2 · Wed Apr 10 2024 20:11:26 GMT+0800 (China Standard Time)

Yeah, I set num_processed = 8. I didn't change any code when I run the evaluation. Do I need to manually change the code?

Kaichen Zhang - NTU · Answer 3 · Thu Apr 11 2024 09:15:26 GMT+0800 (China Standard Time)

Maybe you can refer to #31 or #12

gongysh2004 · Answer 4 · Fri Apr 19 2024 18:03:20 GMT+0800 (China Standard Time)

@simplelifetime can you run it now? I have the same issue.

accelerate launch --num_processes=8 -m lmms_e
val --model llava   --model_args pretrained="liuhaotian/llava-v1.5-7b"   --tasks mme  --batch_size 1 --log_sampl
es --log_samples_suffix llava_v1.5_mme --output_path ./logs/

with the pip freeze output:

absl-py==2.1.0
accelerate==0.21.0
aiofiles==23.2.1
aiohttp==3.9.5
aiosignal==1.3.1
altair==5.3.0
annotated-types==0.6.0
anyio==4.3.0
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.2.0
bitsandbytes==0.43.1
black==24.1.0
certifi==2024.2.2
cfgv==3.4.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
contourpy==1.2.1
cycler==0.12.1
DataProperty==1.0.1
datasets==2.16.1
deepspeed==0.12.6
dill==0.3.7
distlib==0.3.8
distro==1.9.0
docker-pycreds==0.4.0
einops==0.6.1
einops-exts==0.0.4
et-xmlfile==1.1.0
evaluate==0.4.1
exceptiongroup==1.2.0
fastapi==0.110.1
ffmpy==0.3.2
filelock==3.13.4
flash-attn==2.5.7
fonttools==4.51.0
frozenlist==1.4.1
fsspec==2023.10.0
gitdb==4.0.11
GitPython==3.1.43
gradio==4.16.0
gradio_client==0.8.1
h11==0.14.0
hf_transfer==0.1.6
hjson==3.1.0
httpcore==0.17.3
httpx==0.24.0
huggingface-hub==0.22.2
identify==2.5.35
idna==3.7
importlib_resources==6.4.0
Jinja2==3.1.3
joblib==1.4.0
jsonlines==4.0.0
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
Levenshtein==0.25.1
-e git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874#egg=llava
lmms_eval==0.1.2
lxml==5.2.1
markdown-it-py==3.0.0
markdown2==2.4.13
MarkupSafe==2.1.5
matplotlib==3.8.4
mbstrdecoder==1.1.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
mypy-extensions==1.0.0
networkx==3.3
ninja==1.11.1.1
nltk==3.8.1
nodeenv==1.8.0
numexpr==2.10.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.1.105
openai==1.23.1
openpyxl==3.1.2
orjson==3.10.0
packaging==24.0
pandas==2.2.2
pathspec==0.12.1
pathvalidate==3.2.0
peft==0.10.0
pillow==10.3.0
platformdirs==4.2.0
portalocker==2.8.2
pre-commit==3.7.0
protobuf==4.25.3
psutil==5.9.8
py-cpuinfo==9.0.0
pyarrow==15.0.2
pyarrow-hotfix==0.6
pybind11==2.12.0
pycocoevalcap==1.2
pycocotools==2.0.7
pydantic==2.7.0
pydantic_core==2.18.1
pydub==0.25.1
Pygments==2.17.2
pynvml==11.5.0
pyparsing==3.1.2
pytablewriter==1.2.0
python-dateutil==2.9.0.post0
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
rapidfuzz==3.8.1
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
responses==0.18.0
rich==13.7.1
rouge-score==0.1.2
rpds-py==0.18.0
ruff==0.3.7
sacrebleu==2.4.2
safetensors==0.4.2
scikit-learn==1.2.2
scipy==1.13.0
semantic-version==2.10.0
sentencepiece==0.1.99
sentry-sdk==1.45.0
setproctitle==1.3.3
shellingham==1.5.4
shortuuid==1.0.13
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sqlitedict==2.1.0
starlette==0.37.2
svgwrite==1.4.3
sympy==1.12
tabledata==1.3.3
tabulate==0.9.0
tcolorpy==0.1.4
tenacity==8.2.3
threadpoolctl==3.4.0
tiktoken==0.6.0
timm==0.6.13
tokenizers==0.15.1
tomli==2.0.1
tomlkit==0.12.0
toolz==0.12.1
torch==2.1.2
torchaudio==2.1.2+cu121
torchvision==0.16.2
tqdm==4.66.2
tqdm-multiprocess==0.0.11
transformers==4.37.2
transformers-stream-generator==0.0.5
triton==2.1.0
typepy==1.3.2
typer==0.12.3
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
uvicorn==0.29.0
virtualenv==20.25.3
wandb==0.16.6
wavedrom==2.0.3.post3
websockets==11.0.3
xxhash==3.4.1
yarl==1.9.4
zstandard==0.22.0

fra31 · Answer 5 · Fri Apr 26 2024 23:26:37 GMT+0800 (China Standard Time)

@gongysh2004 Not sure if you found a better solution, but setting device_map="" here seems to work for me.