horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page:http://horovod.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Launching horovod task function was not successful

Cow-Kite opened this issue · comments

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet) PyTorch
  2. Framework version: 2.0.1+cu117
  3. Horovod version: 0.28.1
  4. MPI version: 4.0.3
  5. CUDA version:
  6. NCCL version:
  7. Python version: 3.8.10
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version: Ubuntu 20.04
  11. GCC version: 9.4.0
  12. CMake version: 3.27.1

Checklist:

  1. Did you search issues to find if somebody asked this question before? yes
  2. If your question is about hang, did you read this doc? yes
  3. If your question is about docker, did you read this doc? yes
  4. Did you check if you question is answered in the [troubleshooting guide] (https://github.com/horovod/horovod/blob/master/docs/troubleshooting.rst)? yes

Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.

Hello,
While performing model distributed training using Horovod, I encountered an error. The issue seems to occur only when running on multiple nodes, as there are no problems on a single node.
A file sharing system between nodes has been established using NFS.

1. Here is the execution code:

horovodrun -np 4 --mpi -H MN:2,SN01:2 python3 test.py

Error:

Launching horovod task function was not successful:
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/task_fn.py", line 66, in <module>
    _task_fn(index, num_hosts, driver_addresses, settings)
  File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/task_fn.py", line 31, in _task_fn
    task.wait_for_initial_registration(settings.start_timeout)
  File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/common/service/task_service.py", line 253, in wait_for_initial_registration
    timeout.check_time_out_for('tasks to start')
  File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/common/util/timeout.py", line 39, in check_time_out_for
    raise TimeoutException(
horovod.runner.common.util.timeout.TimeoutException: Timed out waiting for tasks to start. Please check connectivity between servers. You may need to increase the --start-timeout parameter if you have too many servers. Timeout after 30 seconds.

kang@MN:~/nfs$ /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 5 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

2. The execution code with the --verbose option added looks like this:

horovodrun --verbose -np 4 --mpi -H MN:2,SN01:2 python3 test.py

Error:

Filtering local host names.
Remote host found: SN01
Checking ssh on all remote hosts.
SSH was successful into all the remote hosts.
Testing interfaces on all the hosts.
Launched horovod server.
Launching horovod task function: /usr/bin/python3 -m horovod.runner.task_fn gAVLAC4= gAVLAi4= gAWVhwAAAAAAAAB9lCiMAmxvlF2UjAkxMjcuMC4wLjGUTcSWhpRhjAdlbnAxMHMwlF2UjA0xOTIuMTY4LjAuMTAwlE3EloaUYYwHZG9ja2VyMJRdlIwKMTcyLjE3LjAuMZRNxJaGlGGMDHZ4bGFuLmNhbGljb5RdlIwOMTcyLjE4LjE5NS4xMjiUTcSWhpRhdS4= gAWVVgIAAAAAAACMI2hvcm92b2QucnVubmVyLmNvbW1vbi51dGlsLnNldHRpbmdzlIwIU2V0dGluZ3OUk5QpgZR9lCiMCG51bV9wcm9jlEsEjAd2ZXJib3NllEsCjAhzc2hfcG9ydJROjBFzc2hfaWRlbnRpdHlfZmlsZZROjA5leHRyYV9tcGlfYXJnc5ROjAh0Y3BfZmxhZ5SJjAxiaW5kaW5nX2FyZ3OUTowDa2V5lEMgf+O/8DEtrzrMl4uoHQD+0+OGBsA90Sb+TCCd6NBRseWUjA1zdGFydF90aW1lb3V0lIwiaG9yb3ZvZC5ydW5uZXIuY29tbW9uLnV0aWwudGltZW91dJSMB1RpbWVvdXSUk5QpgZR9lCiMCF90aW1lb3V0lEsejAtfdGltZW91dF9hdJRHQdk0lZPTke2MCF9tZXNzYWdllIyhVGltZWQgb3V0IHdhaXRpbmcgZm9yIHthY3Rpdml0eX0uIFBsZWFzZSBjaGVjayBjb25uZWN0aXZpdHkgYmV0d2VlbiBzZXJ2ZXJzLiBZb3UgbWF5IG5lZWQgdG8gaW5jcmVhc2UgdGhlIC0tc3RhcnQtdGltZW91dCBwYXJhbWV0ZXIgaWYgeW91IGhhdmUgdG9vIG1hbnkgc2VydmVycy6UdWKMD291dHB1dF9maWxlbmFtZZROjA1ydW5fZnVuY19tb2RllImMBG5pY3OUTowHZWxhc3RpY5SJjBxwcmVmaXhfb3V0cHV0X3dpdGhfdGltZXN0YW1wlImMBWhvc3RzlIwLTU46MixTTjAxOjKUdWIu
Launching horovod task function: ssh -o PasswordAuthentication=no -o StrictHostKeyChecking=no SN01    /usr/bin/python3 -m horovod.runner.task_fn gAVLAS4= gAVLAi4= gAWVhwAAAAAAAAB9lCiMAmxvlF2UjAkxMjcuMC4wLjGUTcSWhpRhjAdlbnAxMHMwlF2UjA0xOTIuMTY4LjAuMTAwlE3EloaUYYwHZG9ja2VyMJRdlIwKMTcyLjE3LjAuMZRNxJaGlGGMDHZ4bGFuLmNhbGljb5RdlIwOMTcyLjE4LjE5NS4xMjiUTcSWhpRhdS4= gAWVVgIAAAAAAACMI2hvcm92b2QucnVubmVyLmNvbW1vbi51dGlsLnNldHRpbmdzlIwIU2V0dGluZ3OUk5QpgZR9lCiMCG51bV9wcm9jlEsEjAd2ZXJib3NllEsCjAhzc2hfcG9ydJROjBFzc2hfaWRlbnRpdHlfZmlsZZROjA5leHRyYV9tcGlfYXJnc5ROjAh0Y3BfZmxhZ5SJjAxiaW5kaW5nX2FyZ3OUTowDa2V5lEMgf+O/8DEtrzrMl4uoHQD+0+OGBsA90Sb+TCCd6NBRseWUjA1zdGFydF90aW1lb3V0lIwiaG9yb3ZvZC5ydW5uZXIuY29tbW9uLnV0aWwudGltZW91dJSMB1RpbWVvdXSUk5QpgZR9lCiMCF90aW1lb3V0lEsejAtfdGltZW91dF9hdJRHQdk0lZPTke2MCF9tZXNzYWdllIyhVGltZWQgb3V0IHdhaXRpbmcgZm9yIHthY3Rpdml0eX0uIFBsZWFzZSBjaGVjayBjb25uZWN0aXZpdHkgYmV0d2VlbiBzZXJ2ZXJzLiBZb3UgbWF5IG5lZWQgdG8gaW5jcmVhc2UgdGhlIC0tc3RhcnQtdGltZW91dCBwYXJhbWV0ZXIgaWYgeW91IGhhdmUgdG9vIG1hbnkgc2VydmVycy6UdWKMD291dHB1dF9maWxlbmFtZZROjA1ydW5fZnVuY19tb2RllImMBG5pY3OUTowHZWxhc3RpY5SJjBxwcmVmaXhfb3V0cHV0X3dpdGhfdGltZXN0YW1wlImMBWhvc3RzlIwLTU46MixTTjAxOjKUdWIu
Attempted to launch horovod task servers.
Waiting for the hosts to acknowledge.
Launching horovod task function was not successful:
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/task_fn.py", line 66, in <module>
    _task_fn(index, num_hosts, driver_addresses, settings)
  File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/task_fn.py", line 31, in _task_fn
    task.wait_for_initial_registration(settings.start_timeout)
  File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/common/service/task_service.py", line 253, in wait_for_initial_registration
    timeout.check_time_out_for('tasks to start')
  File "/home/kang/.local/lib/python3.8/site-packages/horovod/runner/common/util/timeout.py", line 39, in check_time_out_for
    raise TimeoutException(
horovod.runner.common.util.timeout.TimeoutException: Timed out waiting for tasks to start. Please check connectivity between servers. You may need to increase the --start-timeout parameter if you have too many servers. Timeout after 30 seconds.

kang@MN:~/nfs$ /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 5 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

SSH connections between nodes seem to be working fine. Where can I find the source of the error? Thank you.