Horovod + Deepspeed : Device mismatch error
PurvangL opened this issue · comments
Purvang Lapsiwala commented
Environment:
Machine Info : 8xA100 (80G)
- Framework: (TensorFlow, Keras, PyTorch, MXNet) : Pytorch
- Framework version: 1.12.1+cu113
- Horovod version: 0.28.1
- MPI version: 3.1.5
- CUDA version:
- NCCL version:
- Python version: 3.8.10
- Spark / PySpark version:
- Ray version:
- OS and version: Ubuntu 20.04
- GCC version:
- CMake version:
Checklist:
- Did you search issues to find if somebody asked this question before? Yes
- If your question is about hang, did you read this doc?
- If your question is about docker, did you read this doc?
- Did you check if you question is answered in the troubleshooting guide?
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "sc2.py", line 178, in <module>
[1,1]<stderr>: outputs = model(input_ids=d['input_ids'],attention_mask=d['attention_mask'])
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
[1,1]<stderr>: return forward_call(*input, **kwargs)
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[1,1]<stderr>: ret_val = func(*args, **kwargs)
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1842, in forward
[1,1]<stderr>: loss = self.module(*inputs, **kwargs)
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
[1,1]<stderr>: result = forward_call(*input, **kwargs)
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward
[1,1]<stderr>: outputs = self.model(
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
[1,1]<stderr>: result = forward_call(*input, **kwargs)
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/transformers/models/llama/modeling_llama.py", line 1027, in forward
[1,1]<stderr>: inputs_embeds = self.embed_tokens(input_ids)
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
[1,1]<stderr>: result = forward_call(*input, **kwargs)
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 158, in forward
[1,1]<stderr>: return F.embedding(
[1,1]<stderr>: File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2199, in embedding
[1,1]<stderr>: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[1,1]<stderr>:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking argument for argument index in method wrapper__index_select)
Environment setup
Docker : horovod/horovod:latest
pip install datasets evaluate accelerate==0.25.0 transformers==4.37.0 deepspeed==0.13.1
pip install git+https://github.com/aicrumb/datasettokenizer -q
I think I am not sure if script is correct or not. I am still under process of making it work.
Let me know if need any additional information.