The program blocks hvd.init().
divmid opened this issue · comments
divmid commented
Environment:
- Framework: (TensorFlow, Keras, PyTorch, MXNet): TensorFlow
- Framework version: 2.9.2
- Horovod version: 0.28.1
- MPI version: mpirun (Open MPI) 4.1.4
- CUDA version:
- NCCL version:
- Python version: 3.8.10
- Spark / PySpark version:
- Ray version:
- OS and version:
- GCC version:
- CMake version:
Checklist:
- Did you search issues to find if somebody asked this question before?
- If your question is about hang, did you read this doc?
- If your question is about docker, did you read this doc?
- Did you check if you question is answered in the troubleshooting guide?
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
- My physical host is:
[root@bm83 ~]# cat /etc/centos-release
CentOS Linux release 8.1.1911 (Core)
top - 15:59:32 up 345 days, 5:36, 2 users, load average: 3.35, 3.45, 3.31
Tasks: 395 total, 1 running, 380 sleeping, 14 stopped, 0 zombie
%Cpu(s): 5.3 us, 5.9 sy, 0.0 ni, 85.8 id, 1.6 wa, 0.2 hi, 1.1 si, 0.0 st
MiB Mem : 64260.5 total, 383.6 free, 5623.3 used, 58253.6 buff/cache
MiB Swap: 32288.0 total, 25981.3 free, 6306.7 used. 57995.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12652 root 20 0 1838576 363928 183776 S 97.7 0.6 16:13.30 python
12653 root 20 0 1838524 364924 184768 S 97.7 0.6 16:14.91 python
17665 1000 20 0 16.1g 2.1g 25564 S 57.3 3.4 419294:36 java
2.The way I have to build my environment is:
[root@bm83 ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
horovod/horovod latest 4f3896dc9b9e 7 months ago 14.3GB
docker run -it -d --privileged --name horovod --network host -v /data/ssh/:/root/.ssh/ -v /data/horovod:/data/ horovod/horovod:latest
docker exec -it horovod /bin/bash
sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
sed -i 's/#PubkeyAuthentication yes/PubkeyAuthentication yes/' /etc/ssh/sshd_config
sed -i 's/#Port 22/Port 12345/' /etc/ssh/sshd_config
service ssh restart
apt update -y && apt install rsync net-tools vim ncat telnet -y
3.The script code I executed was main.py:
import tensorflow as tf
import numpy as np
from tensorflow import keras
import horovod.tensorflow.keras as hvd
print("1111111111111111111")
hvd.init()
print("2222222222222222222")
model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd', loss='mean_squared_error')
xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-2.0, 1.0, 4.0, 7.0, 10.0, 13.0], dtype=float)
model.fit(xs, ys, epochs=3000)
if hvd.rank() == 0:
model.save_weights("adasd.h5")
4.I have to activate the command is:
root@bm83:/data/QuakeMitchell# export HOROVOD_LOG_LEVEL=trace
root@bm83:/data/QuakeMitchell# mpirun --allow-run-as-root -oversubscribe --mca oob_tcp_include eth0,eth2 --mca btl tcp,self --mca oob tcp -map-by slot --mca plm_rsh_args "-p 12345 -q -o StrictHostKeyChecking=no" -np 2 -H 10.206.74.32:2 python /data/QuakeMitchell/main.py
1111111111111111111
[2024-01-26 07:43:02.115518: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:107] Using MPI to perform controller operations.
[2024-01-26 07:43:02.115573: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:73] Using MPI to perform CPU operations.
[2024-01-26 07:43:02.115589: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.h:51] MPI context enabled.
[2024-01-26 07:43:02.115612: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_controller.h:36] MPI Controller constructed.
1111111111111111111
[2024-01-26 07:43:02.118399: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:107] Using MPI to perform controller operations.
[2024-01-26 07:43:02.118443: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:73] Using MPI to perform CPU operations.
[2024-01-26 07:43:02.118473: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.h:51] MPI context enabled.
[2024-01-26 07:43:02.118503: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_controller.h:36] MPI Controller constructed.
[2024-01-26 07:43:02.185741: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.cc:195] Using MPI_COMM_WORLD as global communicator.
[2024-01-26 07:43:02.185741: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.cc:195] Using MPI_COMM_WORLD as global communicator.
--------------The program blocks hvd.init()-------------
root@bm83:/data/QuakeMitchell# top
top - 07:28:21 up 345 days, 5:05, 1 user, load average: 2.62, 3.10, 3.13
Tasks: 26 total, 1 running, 11 sleeping, 14 stopped, 0 zombie
%Cpu(s): 5.5 us, 5.7 sy, 0.0 ni, 87.1 id, 0.3 wa, 0.2 hi, 1.1 si, 0.0 st
MiB Mem : 64260.5 total, 308.6 free, 5691.5 used, 58260.4 buff/cache
MiB Swap: 32288.0 total, 26062.5 free, 6225.5 used. 57926.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9134 root 20 0 1842688 367908 183580 S 98.3 0.6 1:43.27 python
9135 root 20 0 1842640 368152 183812 S 97.7 0.6 1:42.90 python
1 root 20 0 4244 0 0 S 0.0 0.0 0:00.04 bash
29 root 20 0 4244 1808 1544 S 0.0 0.0 0:00.15 bash
Ata Fatahi commented
Neither your code nor the way you're using horovod sounds correct. Please follow the keras example here:
https://github.com/horovod/horovod/blob/master/examples/keras/keras_mnist.py
Also follow the horovod-mpi docs to see how to run the program using horovodrun
command:
https://github.com/horovod/horovod/blob/master/examples/keras/keras_mnist.py