horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page:http://horovod.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unexpected Worker Failure when using Elastic Horovod + Process Sets

Pranavug opened this issue · comments

Environment:

  1. Framework: PyTorch
  2. Framework version: 1.9.0+cu102
  3. Horovod version: 0.28.1
  4. MPI version: N/A
  5. CUDA version: cu102
  6. NCCL version: 2708
  7. Python version: 3.9.18
  8. Spark / PySpark version: N/A
  9. Ray version: N/A
  10. OS and version: Linux SMP x86_64 x86_64 x86_64 GNU/Linux
  11. GCC version: 7.3.1
  12. CMake version: 3.14

Bug report:

import horovod.torch as hvd
import time

worker_1_process_set = hvd.ProcessSet([1])
worker_2_process_set = hvd.ProcessSet([0, 2])

hvd.init(process_sets="dynamic")
hvd.add_process_set(worker_1_process_set)
hvd.add_process_set(worker_2_process_set)

@hvd.elastic.run
def main(state):
    rank = hvd.rank()
    size = hvd.size()

    if rank == 0:
        while True:
            print(f"Sleeping for 1 second: {rank}", flush=True)
            time.sleep(1)

    elif rank == 1:
        while True:
            print(f"Sleeping for 1 second: {rank}", flush=True)
            time.sleep(1)

    elif rank == 2:
        while True:
            print(f"Sleeping for 1 second: {rank}", flush=True)
            time.sleep(1)


if __name__ == '__main__':
    print(f"Initialized with rank {hvd.rank()}", flush=True)

    # Initialize the TorchState
    state = hvd.elastic.TorchState()

    print(f"Running main with rank {hvd.rank()}", flush=True)
    main(state)
    print(f"Finished running main with rank {hvd.rank()}", flush=True)

    print(f"Joined with rank {hvd.rank()}", flush=True)

I am running the code above using elastic horovod and using process sets as described above. I am using the following command to run all 3 workers on a single node. After killing one of the processes from a terminal, all the remaining processes are killed. If I do the same workflow using the same command BUT WITHOUT using process sets, after terminating only one process the remaining 2 workers are not terminated. Basically, while using process sets with elastic horovod I was expecting that one worker failure would not terminate the remaining processes as it's happening in the log below. However, for some reason when I dont use process sets, the remaining workers stay alive as expected. What could be the reason here? Is this a bug or am i missing something while using the process sets? Please help

Similar issues:

  1. #2484
(horovod-setup) (miniconda3) [pgadikar@ip-10-20-1-15 experiments]$ horovodrun -np 3 --min-np 2 --host-discovery-script discover-hosts.sh --elastic-timeout 5 --network-interfaces eth0,lo python mast
er-child-exp.py
[1]<stdout>:Initialized with rank 1
[1]<stdout>:Running main with rank 1
[2]<stdout>:Initialized with rank 2
[2]<stdout>:Running main with rank 2
[0]<stdout>:Initialized with rank 0
[0]<stdout>:Running main with rank 0
[1]<stdout>:Sleeping for 1 second: 1
[2]<stdout>:Sleeping for 1 second: 2
[0]<stdout>:Sleeping for 1 second: 0
[2]<stderr>:[2024-02-07 04:16:27.910743: E /tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/horovod/common/operations.cc:697] [2]: Horovod background loop uncaught exception: [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:20903: Connection reset by peer
[0]<stderr>:[2024-02-07 04:16:27.910752: E /tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/horovod/common/operations.cc:697] [0]: Horovod background loop uncaught exception: [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:49541: Connection reset by peer
[2]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[0]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[2]<stderr>:  what():  [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:20903: Connection reset by peer
[0]<stderr>:  what():  [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:49541: Connection reset by peer
Process 1 exit with status code 143.
Process 2 exit with status code 134.
Process 0 exit with status code 134.
ERROR:root:failure count == 3 -> stop running
Traceback (most recent call last):
  File "/home/pgadikar/miniconda3/envs/horovod-setup/bin/horovodrun", line 8, in <module>
    sys.exit(run_commandline())
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 837, in run_commandline
    _run(args)
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 825, in _run
    return _run_elastic(args)
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 738, in _run_elastic
    return gloo_run_elastic(settings, env, args.run_func if args.run_func else args.command, executable)
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/gloo_run.py", line 380, in gloo_run_elastic
    return launch_gloo_elastic(command_or_func, exec_command, settings, env, get_common_interfaces, rendezvous, executable)
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/gloo_run.py", line 351, in launch_gloo_elastic
    raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: ip-10-20-1-15.us-east-2.compute.internal[1]
Exit code: 143

(horovod-setup) (miniconda3) [pgadikar@ip-10-20-1-15 experiments]$