horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page:http://horovod.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Horovod missing ranks

YihuaXuCn opened this issue · comments

Environments:

  1. Pytorch version: 2.0.1
  2. Horovod version: 0.28.1
  3. OpenMPI version: 4.1.5 via conda
  4. CUDA version: 11.6
  5. NCCL version: 2.14.3
  6. Python version: 3.8.13
  7. OS and version: Ubuntu 20.04.4 LTS x86_64

horovod prompt:

[0] [2023-06-26 07:23:58.477178: W /tmp/pip-install-6w4n_ndu/horovod_4c00129df9d64c2680a27be18c9b88cd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
[0] Missing ranks:
[0] 0: [broadcast.assign_conv_block_modules.0.0.bias, broadcast.assign_conv_block_modules.0.0.weight, broadcast.assign_conv_first_modules.0.bias, broadcast.assign_conv_first_modules.0.weight, broadcast.assign_conv_last_modules.0.bias, broadcast.assign_conv_last_modules.0.weight ...]
[0] 1: [broadcast.assign_conv_block_modules.0.0.bias, broadcast.assign_conv_block_modules.0.0.weight, broadcast.assign_conv_first_modules.0.bias, broadcast.assign_conv_first_modules.0.weight, broadcast.assign_conv_last_modules.0.bias, broadcast.assign_conv_last_modules.0.weight ...]
[0] [2023-06-26 07:24:58.477371: W /tmp/pip-install-6w4n_ndu/horovod_4c00129df9d6

After I use ctrl+c to terminate the program, then it shows:

Press Ctrl-C again to force abort
[0] hvd initial done
[0] hvd pin GPU done
[0] hvd limit num of CPU threads done

I use nvidia-smi to check the usage of GPU; only two of them are working.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:01:00.0 Off |                  Off |
| 30%   34C    P2    71W / 300W |    663MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:41:00.0 Off |                  Off |
| 30%   31C    P2    76W / 300W |    663MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    Off  | 00000000:81:00.0 Off |                  Off |
| 30%   36C    P5    72W / 300W |      2MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    Off  | 00000000:C1:00.0 Off |                  Off |
| 30%   33C    P0    70W / 300W |      2MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1562444      C   python                            661MiB |
|    1   N/A  N/A   1562445      C   python                            661MiB |
+-----------------------------------------------------------------------------+

I can correctly run the code of the Pytorch example of Horovod and get the final results, so the environment should be correct. Then, I mimic the steps and structure of the example in my project. This is a piece of code from my main function:

def main(args):
    assert args.bmname is not None, "bmname can't be None"
    rnd = 42
    # Horovod: initialize library.
    hvd.init()
    torch.manual_seed(rnd) # attention
    print("hvd initial done")
    # Horovod: pin GPU to local rank.
    torch.cuda.set_device(hvd.local_rank())
    torch.cuda.manual_seed(rnd)
    print("hvd pin GPU done")

Based on the print information, I doubt hvd.init() does not work correctly, or I missed some key steps.

I solved the problem. hvd.init() has no problem.