Unexpected Worker Failure when using Elastic Horovod + Process Sets
Pranavug opened this issue · comments
Environment:
- Framework: PyTorch
- Framework version: 1.9.0+cu102
- Horovod version: 0.28.1
- MPI version: N/A
- CUDA version: cu102
- NCCL version: 2708
- Python version: 3.9.18
- Spark / PySpark version: N/A
- Ray version: N/A
- OS and version: Linux SMP x86_64 x86_64 x86_64 GNU/Linux
- GCC version: 7.3.1
- CMake version: 3.14
Bug report:
import horovod.torch as hvd
import time
worker_1_process_set = hvd.ProcessSet([1])
worker_2_process_set = hvd.ProcessSet([0, 2])
hvd.init(process_sets="dynamic")
hvd.add_process_set(worker_1_process_set)
hvd.add_process_set(worker_2_process_set)
@hvd.elastic.run
def main(state):
rank = hvd.rank()
size = hvd.size()
if rank == 0:
while True:
print(f"Sleeping for 1 second: {rank}", flush=True)
time.sleep(1)
elif rank == 1:
while True:
print(f"Sleeping for 1 second: {rank}", flush=True)
time.sleep(1)
elif rank == 2:
while True:
print(f"Sleeping for 1 second: {rank}", flush=True)
time.sleep(1)
if __name__ == '__main__':
print(f"Initialized with rank {hvd.rank()}", flush=True)
# Initialize the TorchState
state = hvd.elastic.TorchState()
print(f"Running main with rank {hvd.rank()}", flush=True)
main(state)
print(f"Finished running main with rank {hvd.rank()}", flush=True)
print(f"Joined with rank {hvd.rank()}", flush=True)
I am running the code above using elastic horovod and using process sets as described above. I am using the following command to run all 3 workers on a single node. After killing one of the processes from a terminal, all the remaining processes are killed. If I do the same workflow using the same command BUT WITHOUT using process sets, after terminating only one process the remaining 2 workers are not terminated. Basically, while using process sets with elastic horovod I was expecting that one worker failure would not terminate the remaining processes as it's happening in the log below. However, for some reason when I dont use process sets, the remaining workers stay alive as expected. What could be the reason here? Is this a bug or am i missing something while using the process sets? Please help
Similar issues:
(horovod-setup) (miniconda3) [pgadikar@ip-10-20-1-15 experiments]$ horovodrun -np 3 --min-np 2 --host-discovery-script discover-hosts.sh --elastic-timeout 5 --network-interfaces eth0,lo python mast
er-child-exp.py
[1]<stdout>:Initialized with rank 1
[1]<stdout>:Running main with rank 1
[2]<stdout>:Initialized with rank 2
[2]<stdout>:Running main with rank 2
[0]<stdout>:Initialized with rank 0
[0]<stdout>:Running main with rank 0
[1]<stdout>:Sleeping for 1 second: 1
[2]<stdout>:Sleeping for 1 second: 2
[0]<stdout>:Sleeping for 1 second: 0
[2]<stderr>:[2024-02-07 04:16:27.910743: E /tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/horovod/common/operations.cc:697] [2]: Horovod background loop uncaught exception: [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:20903: Connection reset by peer
[0]<stderr>:[2024-02-07 04:16:27.910752: E /tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/horovod/common/operations.cc:697] [0]: Horovod background loop uncaught exception: [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:49541: Connection reset by peer
[2]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[0]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[2]<stderr>: what(): [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:20903: Connection reset by peer
[0]<stderr>: what(): [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:49541: Connection reset by peer
Process 1 exit with status code 143.
Process 2 exit with status code 134.
Process 0 exit with status code 134.
ERROR:root:failure count == 3 -> stop running
Traceback (most recent call last):
File "/home/pgadikar/miniconda3/envs/horovod-setup/bin/horovodrun", line 8, in <module>
sys.exit(run_commandline())
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 837, in run_commandline
_run(args)
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 825, in _run
return _run_elastic(args)
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 738, in _run_elastic
return gloo_run_elastic(settings, env, args.run_func if args.run_func else args.command, executable)
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/gloo_run.py", line 380, in gloo_run_elastic
return launch_gloo_elastic(command_or_func, exec_command, settings, env, get_common_interfaces, rendezvous, executable)
File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/gloo_run.py", line 351, in launch_gloo_elastic
raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: ip-10-20-1-15.us-east-2.compute.internal[1]
Exit code: 143
(horovod-setup) (miniconda3) [pgadikar@ip-10-20-1-15 experiments]$