Add support for Hydra MPI
maxhgerlach opened this issue · comments
Max H. Gerlach commented
Discussed in #2761
Originally posted by dzhwinter October 21, 2020
2020-10-21 21:17:16.484873: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Unknown MPI implementation given in output of mpirun --version:
HYDRA build details:
Version: 3.3.2
Release Date: Tue Nov 12 21:23:16 CST 2019
CC: gcc
CXX: g++
F77: gfortran
F90: gfortran
Configure options: '--disable-option-checking' '--prefix=/usr/local' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/openpa/src -I/root/mpich-3.3.2/src/openpa/src -D_REENTRANT -I/root/mpich-3.3.2/src/mpi/romio/include' 'MPLLIBNAME=mpl'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Checkpointing libraries available:
Demux engines available: poll select
Unknown MPI implementation given in output of mpirun --version:
HYDRA build details:
Version: 3.3.2
Release Date: Tue Nov 12 21:23:16 CST 2019
CC: gcc
CXX: g++
F77: gfortran
F90: gfortran
Configure options: '--disable-option-checking' '--prefix=/usr/local' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/openpa/src -I/root/mpich-3.3.2/src/openpa/src -D_REENTRANT -I/root/mpich-3.3.2/src/mpi/romio/include' 'MPLLIBNAME=mpl'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Checkpointing libraries available:
Demux engines available: poll select
Unknown MPI implementation given in output of mpirun --version:
HYDRA build details:
Version: 3.3.2
Release Date: Tue Nov 12 21:23:16 CST 2019
CC: gcc
CXX: g++
F77: gfortran
F90: gfortran
Configure options: '--disable-option-checking' '--prefix=/usr/local' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/openpa/src -I/root/mpich-3.3.2/src/openpa/src -D_REENTRANT -I/root/mpich-3.3.2/src/mpi/romio/include' 'MPLLIBNAME=mpl'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Checkpointing libraries available:
Demux engines available: poll select
Traceback (most recent call last):
File "/dockerdata/anaconda3/bin/horovodrun", line 10, in <module>
sys.exit(run_commandline())
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 723, in run_commandline
_run(args)
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 713, in _run
return _run_static(args)
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 571, in _run_static
_launch_job(args, settings, nics, command)
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 686, in _launch_job
args.verbose)
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 657, in run_controller
mpi_run()
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 678, in mpi_run_fn
mpi_run(settings, nics, env, command)
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/mpi_run.py", line 143, in mpi_run
raise Exception(_MPI_NOT_FOUND_ERROR_MSG)
Exception: horovod does not find an installed MPI.
Choose one of:
1. Install Open MPI 4.0.0+ or IBM Spectrum MPI or MPICH and re-install Horovod (use --no-cache-dir pip option).
2. Run distributed training script using the standard way provided by your MPI distribution (usually mpirun, srun, or jsrun).
3. Use built-in gloo option (horovodrun --gloo ...).
The check-build command output
Checking whether extension tensorflow was built.
2020-10-21 21:28:31.314100: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was built.
Checking whether extension torch was built.
Traceback (most recent call last):
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/torch/__init__.py", line 21, in <module>
__file__, 'mpi_lib_v2')
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
ext_name, full_path, ext_env_var
ImportError: Extension horovod.torch has not been built: /dockerdata/anaconda3/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 80, in _target_fn
ext = importlib.import_module('.' + ext_base_name, 'horovod')
File "/dockerdata/anaconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/torch/__init__.py", line 24, in <module>
__file__, 'mpi_lib', '_mpi_lib')
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
ext_name, full_path, ext_env_var
ImportError: Extension horovod.torch has not been built: /dockerdata/anaconda3/lib/python3.7/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Extension torch was NOT built.
Checking whether extension mxnet was built.
Traceback (most recent call last):
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 80, in _target_fn
ext = importlib.import_module('.' + ext_base_name, 'horovod')
File "/dockerdata/anaconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/mxnet/__init__.py", line 19, in <module>
__file__, 'mpi_lib')
File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
ext_name, full_path, ext_env_var
ImportError: Extension horovod.mxnet has not been built: /dockerdata/anaconda3/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_MXNET=1 to debug the build error.
Extension mxnet was NOT built.
Checking whether extension tensorflow was built with MPI.
2020-10-21 21:28:33.215982: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was built with MPI.
Checking whether extension tensorflow was built with Gloo.
2020-10-21 21:28:35.069995: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was built with Gloo.
Checking whether extension tensorflow was built with NCCL.
2020-10-21 21:28:36.950945: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was built with NCCL.
Checking whether extension tensorflow was built with DDL.
2020-10-21 21:28:38.789250: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was NOT built with DDL.
Checking whether extension tensorflow was built with CCL.
2020-10-21 21:28:40.626667: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was NOT built with CCL.
Horovod v0.20.3:
Available Frameworks:
[X] TensorFlow
[ ] PyTorch
[ ] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
```</div>