Tesnorflow2 examples won't run with more than 1 GPU

Question

Tesnorflow2 examples won't run with more than 1 GPU

laytonjbgmail opened this issue 10 months ago · comments

laytonjbgmail commented 10 months ago

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet) TensorFlow
Framework version: tensorflow-2.13.0
Horovod version: horovod-0.28.1
MPI version: openmpi-4.1.2
CUDA version: 11.8
NCCL version: nccl-2.182.12.1
Python version: 3.8.12
Spark / PySpark version: none
Ray version: none
OS and version: Ubuntu 22.04
GCC version: 11.3.0
CMake version: 3.25.0

Checklist:

Did you search issues to find if somebody asked this question before? I did - I didn't find anything useful.
If your question is about hang, did you read this doc? NA
If your question is about docker, did you read this doc? NA
Did you check if you question is answered in the troubleshooting guide? yes - not there

Bug report:
I tried running the ternsorflow2 example from github (https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_keras_mnist.py) with 2 GPUs and I get a segfault (signal 11). Running with 1 GPU works correctly. The command line:

mpirun -np 2 -H laytonjb-APEXX-T3-04:1,laytonjb-APEXX-T3-04:2 -bind-to none --map-by slot -x NCCL_DEBUG=INFO python3 ./tensorflow2_keras_mnist.py

I tried removing all options except "-np 2" and "-H ..." and it results in the same error

The output from the command is below:

$ mpirun -np 2 -H laytonjb-APEXX-T3-04:1,laytonjb-APEXX-T3-04:2 -bind-to none --map-by slot -x NCCL_DEBUG=INFO python3 ./tensorflow2_keras_mnist.py
2023-08-01 13:21:14.848313: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 13:21:14.865791: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 13:21:15.299373: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-01 13:21:15.318233: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-01 13:21:15.868042: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.868481: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.871121: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.871471: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.885529: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.885820: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.886033: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.886238: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.899707: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.900097: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.900401: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:15.900698: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.118472: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.118754: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.118960: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.148485: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.148920: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.149305: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.230149: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.230673: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.230872: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.231060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 47372 MB memory: -> device: 1, name: Quadro RTX 8000, pci bus id: 0000:49:00.0, compute capability: 7.5
2023-08-01 13:21:16.261160: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.261401: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.261599: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-01 13:21:16.261793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 46815 MB memory: -> device: 0, name: Quadro RTX 8000, pci bus id: 0000:21:00.0, compute capability: 7.5
Epoch 1/24
2023-08-01 13:21:17.876383: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape insequential/dropout/dropout/SelectV2-2-TransposeNHWCToNCHW-LayoutOptimizer
2023-08-01 13:21:17.904957: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape insequential/dropout/dropout/SelectV2-2-TransposeNHWCToNCHW-LayoutOptimizer
2023-08-01 13:21:18.014518: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8902
2023-08-01 13:21:18.044255: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8902
[laytonjb-APEXX-T3-04:79057] * Process received signal *
[laytonjb-APEXX-T3-04:79057] Signal: Segmentation fault (11)
[laytonjb-APEXX-T3-04:79057] Signal code: Invalid permissions (2)
[laytonjb-APEXX-T3-04:79057] Failing at address: 0x7f411f371800
[laytonjb-APEXX-T3-04:79058] * Process received signal *
[laytonjb-APEXX-T3-04:79058] Signal: Segmentation fault (11)
[laytonjb-APEXX-T3-04:79058] Signal code: Invalid permissions (2)
[laytonjb-APEXX-T3-04:79058] Failing at address: 0x7f5a33372218
[laytonjb-APEXX-T3-04:79057] [ 0] [laytonjb-APEXX-T3-04:79058] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f451c842520]
[laytonjb-APEXX-T3-04:79057] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f5e35242520]
[laytonjb-APEXX-T3-04:79058] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a08c1)[0x7f451c9a08c1]
[laytonjb-APEXX-T3-04:79057] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x1a08c1)[0x7f5e353a08c1]
[laytonjb-APEXX-T3-04:79058] [ 2] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/openmpi/mca_btl_vader.so(+0x39e4)[0x7f5ddc00e9e4]
[laytonjb-APEXX-T3-04:79058] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/openmpi/mca_btl_vader.so(+0x39e4)[0x7f44bf6789e4]
[laytonjb-APEXX-T3-04:79057] [ 3] [ 3] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x45)[0x7f5dcfe43885]
[laytonjb-APEXX-T3-04:79058] [ 4] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x112d)[0x7f5dcfe3557d]
[laytonjb-APEXX-T3-04:79058] [ 5] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x45)[0x7f44bf66b885]
[laytonjb-APEXX-T3-04:79057] [ 4] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x112d)[0x7f44bf65d57d]
[laytonjb-APEXX-T3-04:79057] [ 5] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_ring+0x30b)[0x7f5dde571d9b]
[laytonjb-APEXX-T3-04:79058] [ 6] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x4f)[0x7f5dcf9a008f]
[laytonjb-APEXX-T3-04:79058] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_ring+0x30b)[0x7f44cf854d9b]
[laytonjb-APEXX-T3-04:79057] [ 6] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x4f)[0x7f44bf01908f]
[laytonjb-APEXX-T3-04:79057] [ 7] [ 7] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/libmpi.so.40(PMPI_Allreduce+0x131)[0x7f44cf80d421]
[laytonjb-APEXX-T3-04:79057] [ 8] /home/laytonjb/bin/GCC-11.3/OPENMPI-4.1.5/lib/libmpi.so.40(PMPI_Allreduce+0x131)[0x7f5dde52a421]
[laytonjb-APEXX-T3-04:79058] [ 8] /home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/horovod/tensorflow/mpi_lib.cpython-39-x86_64-linux-gnu.so(_ZN7horovod6common16MPI_GPUAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x283)[0x7f44c5d22ce3]
[laytonjb-APEXX-T3-04:79057] [ 9] /home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/horovod/tensorflow/mpi_lib.cpython-39-x86_64-linux-gnu.so(_ZN7horovod6common16MPI_GPUAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x283)[0x7f5dde722ce3]
[laytonjb-APEXX-T3-04:79058] [ 9] /home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/horovod/tensorflow/mpi_lib.cpython-39-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7f44c5ce5d1d]
[laytonjb-APEXX-T3-04:79057] [10] /home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/horovod/tensorflow/mpi_lib.cpython-39-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7f5dde6e5d1d]
[laytonjb-APEXX-T3-04:79058] [10] /home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/horovod/tensorflow/mpi_lib.cpython-39-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7f44c5ce61fc]
[laytonjb-APEXX-T3-04:79057] [11] /home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/horovod/tensorflow/mpi_lib.cpython-39-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7f5dde6e61fc]
[laytonjb-APEXX-T3-04:79058] [11] /home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/horovod/tensorflow/mpi_lib.cpython-39-x86_64-linux-gnu.so(+0xb4c23)[0x7f44c5cb4c23]
[laytonjb-APEXX-T3-04:79057] [12] /home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/horovod/tensorflow/mpi_lib.cpython-39-x86_64-linux-gnu.so(+0xb4c23)[0x7f5dde6b4c23]
[laytonjb-APEXX-T3-04:79058] [12] /home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x1ac3e20)[0x7f5e330c3e20]
[laytonjb-APEXX-T3-04:79058] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f5e35294b43]
[laytonjb-APEXX-T3-04:79058] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7f5e35326a00]
[laytonjb-APEXX-T3-04:79058] * End of error message *
/home/laytonjb/miniconda3/envs/tf/lib/python3.9/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x1ac3e20)[0x7f451a8c3e20]
[laytonjb-APEXX-T3-04:79057] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f451c894b43]
[laytonjb-APEXX-T3-04:79057] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7f451c926a00]
[laytonjb-APEXX-T3-04:79057] * End of error message *

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 0 on node laytonjb-APEXX-T3-04 exited on signal 11 (Segmentation fault).

In fact, when I run any example in the tensorflow2 examples it segfaults when running more than 1 GPU.

I hate to make this report long, nut here is the output from "conda list" for the particular virtual environment if that helps.

packages in environment at /home/laytonjb/miniconda3/envs/tf:

horovod / horovod

Tesnorflow2 examples won't run with more than 1 GPU

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 0 on node laytonjb-APEXX-T3-04 exited on signal 11 (Segmentation fault).

packages in environment at /home/laytonjb/miniconda3/envs/tf:

Name Version Build Channel

Tesnorflow2 examples won't run with more than 1 GPU

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 0 with PID 0 on node laytonjb-APEXX-T3-04 exited on signal 11 (Segmentation fault).

packages in environment at /home/laytonjb/miniconda3/envs/tf:

Name Version Build Channel

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.