horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page:http://horovod.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for Hydra MPI

maxhgerlach opened this issue · comments

Discussed in #2761

Originally posted by dzhwinter October 21, 2020

2020-10-21 21:17:16.484873: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Unknown MPI implementation given in output of mpirun --version:
HYDRA build details:
    Version:                                 3.3.2
    Release Date:                            Tue Nov 12 21:23:16 CST 2019
    CC:                              gcc
    CXX:                             g++
    F77:                             gfortran
    F90:                             gfortran
    Configure options:                       '--disable-option-checking' '--prefix=/usr/local' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/openpa/src -I/root/mpich-3.3.2/src/openpa/src -D_REENTRANT -I/root/mpich-3.3.2/src/mpi/romio/include' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:
    Demux engines available:                 poll select

Unknown MPI implementation given in output of mpirun --version:
HYDRA build details:
    Version:                                 3.3.2
    Release Date:                            Tue Nov 12 21:23:16 CST 2019
    CC:                              gcc
    CXX:                             g++
    F77:                             gfortran
    F90:                             gfortran
    Configure options:                       '--disable-option-checking' '--prefix=/usr/local' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/openpa/src -I/root/mpich-3.3.2/src/openpa/src -D_REENTRANT -I/root/mpich-3.3.2/src/mpi/romio/include' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:
    Demux engines available:                 poll select

Unknown MPI implementation given in output of mpirun --version:
HYDRA build details:
    Version:                                 3.3.2
    Release Date:                            Tue Nov 12 21:23:16 CST 2019
    CC:                              gcc
    CXX:                             g++
    F77:                             gfortran
    F90:                             gfortran
    Configure options:                       '--disable-option-checking' '--prefix=/usr/local' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/mpl/include -I/root/mpich-3.3.2/src/openpa/src -I/root/mpich-3.3.2/src/openpa/src -D_REENTRANT -I/root/mpich-3.3.2/src/mpi/romio/include' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:
    Demux engines available:                 poll select

Traceback (most recent call last):
  File "/dockerdata/anaconda3/bin/horovodrun", line 10, in <module>
    sys.exit(run_commandline())
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 723, in run_commandline
    _run(args)
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 713, in _run
    return _run_static(args)
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 571, in _run_static
    _launch_job(args, settings, nics, command)
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 686, in _launch_job
    args.verbose)
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 657, in run_controller
    mpi_run()
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/launch.py", line 678, in mpi_run_fn
    mpi_run(settings, nics, env, command)
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/runner/mpi_run.py", line 143, in mpi_run
    raise Exception(_MPI_NOT_FOUND_ERROR_MSG)
Exception: horovod does not find an installed MPI.

Choose one of:
1. Install Open MPI 4.0.0+ or IBM Spectrum MPI or MPICH and re-install Horovod (use --no-cache-dir pip option).
2. Run distributed training script using the standard way provided by your MPI distribution (usually mpirun, srun, or jsrun).
3. Use built-in gloo option (horovodrun --gloo ...).

The check-build command output

Checking whether extension tensorflow was built.
2020-10-21 21:28:31.314100: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was built.
Checking whether extension torch was built.
Traceback (most recent call last):
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/torch/__init__.py", line 21, in <module>
    __file__, 'mpi_lib_v2')
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
    ext_name, full_path, ext_env_var
ImportError: Extension horovod.torch has not been built: /dockerdata/anaconda3/lib/python3.7/site-packages/horovod/torch/mpi_lib_v2.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 80, in _target_fn
    ext = importlib.import_module('.' + ext_base_name, 'horovod')
  File "/dockerdata/anaconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/torch/__init__.py", line 24, in <module>
    __file__, 'mpi_lib', '_mpi_lib')
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
    ext_name, full_path, ext_env_var
ImportError: Extension horovod.torch has not been built: /dockerdata/anaconda3/lib/python3.7/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Extension torch was NOT built.
Checking whether extension mxnet was built.
Traceback (most recent call last):
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 80, in _target_fn
    ext = importlib.import_module('.' + ext_base_name, 'horovod')
  File "/dockerdata/anaconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/mxnet/__init__.py", line 19, in <module>
    __file__, 'mpi_lib')
  File "/dockerdata/anaconda3/lib/python3.7/site-packages/horovod/common/util.py", line 56, in check_extension
    ext_name, full_path, ext_env_var
ImportError: Extension horovod.mxnet has not been built: /dockerdata/anaconda3/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_MXNET=1 to debug the build error.
Extension mxnet was NOT built.
Checking whether extension tensorflow was built with MPI.
2020-10-21 21:28:33.215982: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was built with MPI.
Checking whether extension tensorflow was built with Gloo.
2020-10-21 21:28:35.069995: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was built with Gloo.
Checking whether extension tensorflow was built with NCCL.
2020-10-21 21:28:36.950945: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was built with NCCL.
Checking whether extension tensorflow was built with DDL.
2020-10-21 21:28:38.789250: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was NOT built with DDL.
Checking whether extension tensorflow was built with CCL.
2020-10-21 21:28:40.626667: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Extension tensorflow was NOT built with CCL.

Horovod v0.20.3:

Available Frameworks:
    [X] TensorFlow
    [ ] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo
```</div>