horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page:http://horovod.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Horovod stack trace from Signal 7

ajayvohra2005 opened this issue · comments

Environment:

  1. Framework: TensorFlow
  2. Framework version: 2.12.0
  3. Horovod version: 0.28.1
  4. MPI version: 4.1.4-3 (openmpi40-aws)
  5. CUDA version: 11.8
  6. NCCL version: 2.16.5-1+cuda11.8
  7. Python version: 3.10
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version: Ubuntu 20.04
  11. GCC version: 9.4.0
  12. CMake version: 3.26.0

Checklist:

  1. Did you search issues to find if somebody asked this question before? Yes, i searched issues
  2. If your question is about hang, did you read this doc? It is not about a hang.
  3. If your question is about docker, did you read this doc? Not it is not about docker.
  4. Did you check if you question is answered in the [troubleshooting guide] (https://github.com/horovod/horovod/blob/master/docs/troubleshooting.rst)? Yes

Bug report:

Tue Jun 27 22:49:58 2023[1,3]<stderr>:*** Received signal 7 ***
Tue Jun 27 22:49:58 2023[1,3]<stderr>:*** BEGIN MANGLED STACK TRACE ***
Tue Jun 27 22:49:58 2023[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x1793d31)[0x7fbf52976d31]
Tue Jun 27 22:49:58 2023[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x1793d31)[0x7fbe6c106d31]
Tue Jun 27 22:49:58 2023[1,3]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fbf20ce4090]
Tue Jun 27 22:49:58 2023[1,4]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fc007554090]
Tue Jun 27 22:49:58 2023[1,4]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7fc00769cb41]
Tue Jun 27 22:49:58 2023[1,3]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7fbf20e2cb41]
Tue Jun 27 22:49:58 2023[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/horovod/tensorflow/mpi_lib.cpython-310-x86_64-linux-gnu.so(+0x224e35)[0x7fbd556c4e35]
Tue Jun 27 22:49:58 2023[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/horovod/tensorflow/mpi_lib.cpython-310-x86_64-linux-gnu.so(+0x21b6ab)[0x7fbd556bb6ab]
Tue Jun 27 22:49:58 2023[1,3]<stderr>:/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fbf20c86609]
Tue Jun 27 22:49:58 2023[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/horovod/tensorflow/mpi_lib.cpython-310-x86_64-linux-gnu.so(+0x224e35)[0x7fbe3bf34e35]
Tue Jun 27 22:49:58 2023[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/horovod/tensorflow/mpi_lib.cpython-310-x86_64-linux-gnu.so(+0x21b6ab)[0x7fbe3bf2b6ab]
Tue Jun 27 22:49:58 2023[1,4]<stderr>:/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fc0074f6609]
Tue Jun 27 22:49:58 2023[1,3]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fbf20dc0133]
Tue Jun 27 22:49:58 2023[1,3]<stderr>:*** END MANGLED STACK TRACE ***

Please describe erroneous behavior you're observing and steps to reproduce it.

The behavior is unpredictable. Training can run for multiple epochs before stack trace shown above. This bug does not show up when training on a single EC2 p3dn.24xlarge. It only shows up when training on multiple nodes.