Horovod missing ranks
YihuaXuCn opened this issue · comments
YihuaXuCn commented
Environments:
- Pytorch version: 2.0.1
- Horovod version: 0.28.1
- OpenMPI version: 4.1.5 via conda
- CUDA version: 11.6
- NCCL version: 2.14.3
- Python version: 3.8.13
- OS and version: Ubuntu 20.04.4 LTS x86_64
horovod prompt:
[0] [2023-06-26 07:23:58.477178: W /tmp/pip-install-6w4n_ndu/horovod_4c00129df9d64c2680a27be18c9b88cd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[0] Missing ranks:
[0] 0: [broadcast.assign_conv_block_modules.0.0.bias, broadcast.assign_conv_block_modules.0.0.weight, broadcast.assign_conv_first_modules.0.bias, broadcast.assign_conv_first_modules.0.weight, broadcast.assign_conv_last_modules.0.bias, broadcast.assign_conv_last_modules.0.weight ...]
[0] 1: [broadcast.assign_conv_block_modules.0.0.bias, broadcast.assign_conv_block_modules.0.0.weight, broadcast.assign_conv_first_modules.0.bias, broadcast.assign_conv_first_modules.0.weight, broadcast.assign_conv_last_modules.0.bias, broadcast.assign_conv_last_modules.0.weight ...]
[0] [2023-06-26 07:24:58.477371: W /tmp/pip-install-6w4n_ndu/horovod_4c00129df9d6
After I use ctrl+c
to terminate the program, then it shows:
Press Ctrl-C again to force abort
[0] hvd initial done
[0] hvd pin GPU done
[0] hvd limit num of CPU threads done
I use nvidia-smi
to check the usage of GPU; only two of them are working.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 34C P2 71W / 300W | 663MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:41:00.0 Off | Off |
| 30% 31C P2 76W / 300W | 663MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 Off | 00000000:81:00.0 Off | Off |
| 30% 36C P5 72W / 300W | 2MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 Off | 00000000:C1:00.0 Off | Off |
| 30% 33C P0 70W / 300W | 2MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1562444 C python 661MiB |
| 1 N/A N/A 1562445 C python 661MiB |
+-----------------------------------------------------------------------------+
I can correctly run the code of the Pytorch example of Horovod and get the final results, so the environment should be correct. Then, I mimic the steps and structure of the example in my project. This is a piece of code from my main
function:
def main(args):
assert args.bmname is not None, "bmname can't be None"
rnd = 42
# Horovod: initialize library.
hvd.init()
torch.manual_seed(rnd) # attention
print("hvd initial done")
# Horovod: pin GPU to local rank.
torch.cuda.set_device(hvd.local_rank())
torch.cuda.manual_seed(rnd)
print("hvd pin GPU done")
Based on the print information, I doubt hvd.init()
does not work correctly, or I missed some key steps.
YihuaXuCn commented
I solved the problem. hvd.init()
has no problem.