使用 Accelerate 加速Qwen2多卡推理失败 Failed to inference on multiple GPUs using accelerate

Question

使用 Accelerate 加速Qwen2多卡推理失败 Failed to inference on multiple GPUs using accelerate

pillowsofwind opened this issue 2 months ago · comments

Hello,

I try to use accelerate==0.32.1 to assist fast batch inference on GPU.
However, I encounter the following issue when using multiple-GPU:

...
  File "/data/conda_envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/conda_envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/conda_envs/inference/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/data/conda_envs/inference/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1235, in forward
    logits = self.lm_head(hidden_states)
  File "/data/conda_envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/conda_envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/conda_envs/inference/lib/python3.9/site-packages/accelerate/hooks.py", line 169, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/data/conda_envs/inference/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [64,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [66,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
...

It seems the an issue with accelerate/hooks.py, but I don't know the exact bug here.

My code works fine using a single GPU.

Ren Xuancheng · Answer 1 · Tue Jul 23 2024 19:28:57 GMT+0800 (China Standard Time)

hi, there are similar ones reported and it is likely caused by nvidia driver; please search the issues first.

Rongwu Xu · Answer 2 · Wed Jul 24 2024 12:00:02 GMT+0800 (China Standard Time)

Thanks for replying, I find #331 and you may close this now.