Launching horovod task function was not successful
Cow-Kite opened this issue · comments
Environment:
- Framework: (TensorFlow, Keras, PyTorch, MXNet) PyTorch
- Framework version: 2.0.1+cu117
- Horovod version: 0.28.1
- MPI version: 4.0.3
- CUDA version:
- NCCL version:
- Python version: 3.8.10
- Spark / PySpark version:
- Ray version:
- OS and version: Ubuntu 20.04
- GCC version: 9.4.0
- CMake version: 3.27.1
Checklist:
- Did you search issues to find if somebody asked this question before? yes
- If your question is about hang, did you read this doc? yes
- If your question is about docker, did you read this doc? yes
- Did you check if you question is answered in the [troubleshooting guide] (https://github.com/horovod/horovod/blob/master/docs/troubleshooting.rst)? yes
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
Hello,
While performing model distributed training using Horovod, I encountered an error. The issue seems to occur only when running on multiple nodes, as there are no problems on a single node.
A file sharing system between nodes has been established using NFS.
1. Here is the execution code:
horovodrun -np 4 --mpi -H MN:2,SN01:2 python3 test.py
Error:
Launching horovod task function was not successful:
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/task_fn.py", line 66, in <module>
_task_fn(index, num_hosts, driver_addresses, settings)
File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/task_fn.py", line 31, in _task_fn
task.wait_for_initial_registration(settings.start_timeout)
File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/common/service/task_service.py", line 253, in wait_for_initial_registration
timeout.check_time_out_for('tasks to start')
File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/common/util/timeout.py", line 39, in check_time_out_for
raise TimeoutException(
horovod.runner.common.util.timeout.TimeoutException: Timed out waiting for tasks to start. Please check connectivity between servers. You may need to increase the --start-timeout parameter if you have too many servers. Timeout after 30 seconds.
kang@MN:~/nfs$ /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 5 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
2. The execution code with the --verbose option added looks like this:
horovodrun --verbose -np 4 --mpi -H MN:2,SN01:2 python3 test.py
Error:
Filtering local host names.
Remote host found: SN01
Checking ssh on all remote hosts.
SSH was successful into all the remote hosts.
Testing interfaces on all the hosts.
Launched horovod server.
Launching horovod task function: /usr/bin/python3 -m horovod.runner.task_fn gAVLAC4= gAVLAi4= gAWVhwAAAAAAAAB9lCiMAmxvlF2UjAkxMjcuMC4wLjGUTcSWhpRhjAdlbnAxMHMwlF2UjA0xOTIuMTY4LjAuMTAwlE3EloaUYYwHZG9ja2VyMJRdlIwKMTcyLjE3LjAuMZRNxJaGlGGMDHZ4bGFuLmNhbGljb5RdlIwOMTcyLjE4LjE5NS4xMjiUTcSWhpRhdS4= gAWVVgIAAAAAAACMI2hvcm92b2QucnVubmVyLmNvbW1vbi51dGlsLnNldHRpbmdzlIwIU2V0dGluZ3OUk5QpgZR9lCiMCG51bV9wcm9jlEsEjAd2ZXJib3NllEsCjAhzc2hfcG9ydJROjBFzc2hfaWRlbnRpdHlfZmlsZZROjA5leHRyYV9tcGlfYXJnc5ROjAh0Y3BfZmxhZ5SJjAxiaW5kaW5nX2FyZ3OUTowDa2V5lEMgf+O/8DEtrzrMl4uoHQD+0+OGBsA90Sb+TCCd6NBRseWUjA1zdGFydF90aW1lb3V0lIwiaG9yb3ZvZC5ydW5uZXIuY29tbW9uLnV0aWwudGltZW91dJSMB1RpbWVvdXSUk5QpgZR9lCiMCF90aW1lb3V0lEsejAtfdGltZW91dF9hdJRHQdk0lZPTke2MCF9tZXNzYWdllIyhVGltZWQgb3V0IHdhaXRpbmcgZm9yIHthY3Rpdml0eX0uIFBsZWFzZSBjaGVjayBjb25uZWN0aXZpdHkgYmV0d2VlbiBzZXJ2ZXJzLiBZb3UgbWF5IG5lZWQgdG8gaW5jcmVhc2UgdGhlIC0tc3RhcnQtdGltZW91dCBwYXJhbWV0ZXIgaWYgeW91IGhhdmUgdG9vIG1hbnkgc2VydmVycy6UdWKMD291dHB1dF9maWxlbmFtZZROjA1ydW5fZnVuY19tb2RllImMBG5pY3OUTowHZWxhc3RpY5SJjBxwcmVmaXhfb3V0cHV0X3dpdGhfdGltZXN0YW1wlImMBWhvc3RzlIwLTU46MixTTjAxOjKUdWIu
Launching horovod task function: ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no SN01 /usr/bin/python3 -m horovod.runner.task_fn gAVLAS4= gAVLAi4= gAWVhwAAAAAAAAB9lCiMAmxvlF2UjAkxMjcuMC4wLjGUTcSWhpRhjAdlbnAxMHMwlF2UjA0xOTIuMTY4LjAuMTAwlE3EloaUYYwHZG9ja2VyMJRdlIwKMTcyLjE3LjAuMZRNxJaGlGGMDHZ4bGFuLmNhbGljb5RdlIwOMTcyLjE4LjE5NS4xMjiUTcSWhpRhdS4= gAWVVgIAAAAAAACMI2hvcm92b2QucnVubmVyLmNvbW1vbi51dGlsLnNldHRpbmdzlIwIU2V0dGluZ3OUk5QpgZR9lCiMCG51bV9wcm9jlEsEjAd2ZXJib3NllEsCjAhzc2hfcG9ydJROjBFzc2hfaWRlbnRpdHlfZmlsZZROjA5leHRyYV9tcGlfYXJnc5ROjAh0Y3BfZmxhZ5SJjAxiaW5kaW5nX2FyZ3OUTowDa2V5lEMgf+O/8DEtrzrMl4uoHQD+0+OGBsA90Sb+TCCd6NBRseWUjA1zdGFydF90aW1lb3V0lIwiaG9yb3ZvZC5ydW5uZXIuY29tbW9uLnV0aWwudGltZW91dJSMB1RpbWVvdXSUk5QpgZR9lCiMCF90aW1lb3V0lEsejAtfdGltZW91dF9hdJRHQdk0lZPTke2MCF9tZXNzYWdllIyhVGltZWQgb3V0IHdhaXRpbmcgZm9yIHthY3Rpdml0eX0uIFBsZWFzZSBjaGVjayBjb25uZWN0aXZpdHkgYmV0d2VlbiBzZXJ2ZXJzLiBZb3UgbWF5IG5lZWQgdG8gaW5jcmVhc2UgdGhlIC0tc3RhcnQtdGltZW91dCBwYXJhbWV0ZXIgaWYgeW91IGhhdmUgdG9vIG1hbnkgc2VydmVycy6UdWKMD291dHB1dF9maWxlbmFtZZROjA1ydW5fZnVuY19tb2RllImMBG5pY3OUTowHZWxhc3RpY5SJjBxwcmVmaXhfb3V0cHV0X3dpdGhfdGltZXN0YW1wlImMBWhvc3RzlIwLTU46MixTTjAxOjKUdWIu
Attempted to launch horovod task servers.
Waiting for the hosts to acknowledge.
Launching horovod task function was not successful:
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/task_fn.py", line 66, in <module>
_task_fn(index, num_hosts, driver_addresses, settings)
File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/task_fn.py", line 31, in _task_fn
task.wait_for_initial_registration(settings.start_timeout)
File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/common/service/task_service.py", line 253, in wait_for_initial_registration
timeout.check_time_out_for('tasks to start')
File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/common/util/timeout.py", line 39, in check_time_out_for
raise TimeoutException(
horovod.runner.common.util.timeout.TimeoutException: Timed out waiting for tasks to start. Please check connectivity between servers. You may need to increase the --start-timeout parameter if you have too many servers. Timeout after 30 seconds.
kang@MN:~/nfs$ /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 5 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
SSH connections between nodes seem to be working fine. Where can I find the source of the error? Thank you.