horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page:http://horovod.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Horovod + Deepspeed : Device mismatch error

PurvangL opened this issue · comments

Environment:

Machine Info : 8xA100 (80G)

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet) : Pytorch
  2. Framework version: 1.12.1+cu113
  3. Horovod version: 0.28.1
  4. MPI version: 3.1.5
  5. CUDA version:
  6. NCCL version:
  7. Python version: 3.8.10
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version: Ubuntu 20.04
  11. GCC version:
  12. CMake version:

Checklist:

  1. Did you search issues to find if somebody asked this question before? Yes
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.

[1,1]<stderr>:Traceback (most recent call last):                                                                                                                                                                      
[1,1]<stderr>:  File "sc2.py", line 178, in <module>                                                                                                                                                                  
[1,1]<stderr>:    outputs = model(input_ids=d['input_ids'],attention_mask=d['attention_mask'])                                                                                                                        
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl                                                                                                    
[1,1]<stderr>:    return forward_call(*input, **kwargs)                                                                                                                                                               
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn                                                                                                         
[1,1]<stderr>:    ret_val = func(*args, **kwargs)                                                                                                                                                                     
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1842, in forward                                                                                                      
[1,1]<stderr>:    loss = self.module(*inputs, **kwargs)                                                                                                                                                               
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl                                                                                                    
[1,1]<stderr>:    result = forward_call(*input, **kwargs)                                                                                                                                                             
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward                                                                                      
[1,1]<stderr>:    outputs = self.model(                                                                                                                                                                               
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl                                                                                                    
[1,1]<stderr>:    result = forward_call(*input, **kwargs)                                                                                                                                                             
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/transformers/models/llama/modeling_llama.py", line 1027, in forward                                                                                      
[1,1]<stderr>:    inputs_embeds = self.embed_tokens(input_ids)                                                                                                                                                        
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl                                                                                                    
[1,1]<stderr>:    result = forward_call(*input, **kwargs)                                                                                                                                                             
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 158, in forward                                                                                                        
[1,1]<stderr>:    return F.embedding(                                                                                                                                                                                 
[1,1]<stderr>:  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2199, in embedding                                                                                                         
[1,1]<stderr>:    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)                                                                                                                      
[1,1]<stderr>:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking argument for argument index in method wrapper__index_select)    

Environment setup

Docker : horovod/horovod:latest 
pip install datasets evaluate accelerate==0.25.0 transformers==4.37.0 deepspeed==0.13.1
pip install git+https://github.com/aicrumb/datasettokenizer -q

Script

I think I am not sure if script is correct or not. I am still under process of making it work.
Let me know if need any additional information.