horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page:http://horovod.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Horovod with TensorFlow crashed

mythZhu opened this issue · comments

Environment:

  1. Framework: TensorFlow
  2. Framework version: 1.15.3
  3. Horovod version: 0.28.1
  4. MPI version: 4.0.1
  5. CUDA version:
  6. NCCL version:
  7. Python version: 3.6.9
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version: Ubuntu 18.04.6 LTS
  11. GCC version: 7.5.0
  12. CMake version: 3.13.0

Checklist:

  1. Did you search issues to find if somebody asked this question before? Yes
  2. If your question is about hang, did you read this doc? Yes
  3. If your question is about docker, did you read this doc? Yes
  4. Did you check if you question is answered in the troubleshooting guide? Yes

Bug report:
I did CPU training with Horovod + TensorFlow and launched it with OpenMPI. Horovod always crashed with following errors when some workers didn't process any data and directly call hvd.join() to wait for other workers.

munmap_chunk(): invalid pointer
[node-0:410078] *** Process received signal ***
[node-0:410078] Signal: Aborted (6)
[node-0:410078] Signal code:  (-6)
[node-0:410078] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f77f0cebf10]
[node-0:410078] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f77f0cebe87]
[node-0:410078] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f77f0ced7f1]
[node-0:410078] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x89837)[0x7f77f0d36837]
[node-0:410078] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x908ba)[0x7f77f0d3d8ba]
[node-0:410078] [ 5] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x58c)[0x7f77f0d44e9c]
[node-0:410078] [ 6] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN9__gnu_cxx13new_allocatorIN7horovod6common7RequestEE10deallocateEPS3_m+0x20)[0x7f77d197de04]
[node-0:410078] [ 7] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt16allocator_traitsISaIN7horovod6common7RequestEEE10deallocateERS3_PS2_m+0x2b)[0x7f77d197bac8]
[node-0:410078] [ 8] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt12_Vector_baseIN7horovod6common7RequestESaIS2_EE13_M_deallocateEPS2_m+0x32)[0x7f77d1978de0]
[node-0:410078] [ 9] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt12_Vector_baseIN7horovod6common7RequestESaIS2_EED2Ev+0x52)[0x7f77d1977d66]
[node-0:410078] [10] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6vectorIN7horovod6common7RequestESaIS2_EED1Ev+0x41)[0x7f77d1974e7b]
[node-0:410078] [11] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaISA_EEED1Ev+0x1c)[0x7f77d197f9ca]
[node-0:410078] [12] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN9__gnu_cxx13new_allocatorISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaISC_EEEE7destroyISF_EEvPT_+0x1c)[0x7f77d197f9f6]
[node-0:410078] [13] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt16allocator_traitsISaISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaISB_EEEEE7destroyISE_EEvRSF_PT_+0x23)[0x7f77d197e4ee]
[node-0:410078] [14] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt8__detail16_Hashtable_allocISaINS_10_Hash_nodeISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaISD_EEELb1EEEEE18_M_deallocate_nodeEPSH_+0x6c)[0x7f77d197c34c]
[node-0:410078] [15] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_St6vectorIN7horovod6common7RequestESaISB_EEESaISE_ENSt8__detail10_Select1stESt8equal_toIS5_ESt4hashIS5_ENSG_18_Mod_range_hashingENSG_20_Default_ranged_hashENSG_20_Prime_rehash_policyENSG_17_Hashtable_traitsILb1ELb0ELb1EEEE8_M_eraseEmPNSG_15_Hash_node_baseEPNSG_10_Hash_nodeISE_Lb1EEE+0x12b)[0x7f77d197da9f]
[node-0:410078] [16] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_St6vectorIN7horovod6common7RequestESaISB_EEESaISE_ENSt8__detail10_Select1stESt8equal_toIS5_ESt4hashIS5_ENSG_18_Mod_range_hashingENSG_20_Default_ranged_hashENSG_20_Prime_rehash_policyENSG_17_Hashtable_traitsILb1ELb0ELb1EEEE5eraseENSG_20_Node_const_iteratorISE_Lb0ELb1EEE+0x62)[0x7f77d197b67e]
[node-0:410078] [17] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt10_HashtableINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_St6vectorIN7horovod6common7RequestESaISB_EEESaISE_ENSt8__detail10_Select1stESt8equal_toIS5_ESt4hashIS5_ENSG_18_Mod_range_hashingENSG_20_Default_ranged_hashENSG_20_Prime_rehash_policyENSG_17_Hashtable_traitsILb1ELb0ELb1EEEE5eraseENSG_14_Node_iteratorISE_Lb0ELb1EEE+0x45)[0x7f77d1978609]
[node-0:410078] [18] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt13unordered_mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorIN7horovod6common7RequestESaIS9_EESt4hashIS5_ESt8equal_toIS5_ESaISt4pairIKS5_SB_EEE5eraseENSt8__detail14_Node_iteratorISI_Lb0ELb1EEE+0x23)[0x7f77d1975711]
[node-0:410078] [19] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common10Controller17ConstructResponseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x1cf0)[0x7f77d1970eb4]
[node-0:410078] [20] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common10Controller19ComputeResponseListEbRNS0_18HorovodGlobalStateERNS0_10ProcessSetE+0x1c2f)[0x7f77d196e85d]
[node-0:410078] [21] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x10365e)[0x7f77d199d65e]
[node-0:410078] [22] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x102e54)[0x7f77d199ce54]
[node-0:410078] [23] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZSt13__invoke_implIvPFvRN7horovod6common18HorovodGlobalStateEEJSt17reference_wrapperIS2_EEET_St14__invoke_otherOT0_DpOT1_+0x39)[0x7f77d19ae578]
[node-0:410078] [24] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZSt8__invokeIPFvRN7horovod6common18HorovodGlobalStateEEJSt17reference_wrapperIS2_EEENSt15__invoke_resultIT_JDpT0_EE4typeEOS9_DpOSA_+0x4e)[0x7f77d19a9d68]
[node-0:410078] [25] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS4_EEEE9_M_invokeIJLm0ELm1EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE+0x43)[0x7f77d19c39f9]
[node-0:410078] [26] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS4_EEEEclEv+0x2c)[0x7f77d19c399a]
[node-0:410078] [27] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS5_EEEEEE6_M_runEv+0x1c)[0x7f77d19c391e]
[node-0:410078] [28] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44c0)[0x7f73eb5a54c0]
[node-0:410078] [29] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f77f0a956db]
[node-0:410078] *** End of error message ***

OR

free(): invalid next size (normal)
[node-0:391803] *** Process received signal ***
[node-0:391803] Signal: Aborted (6)
[node-0:391803] Signal code:  (-6)
[node-0:391803] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7fd1dc5a7f10]
[node-0:391803] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fd1dc5a7e87]
[node-0:391803] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fd1dc5a97f1]
[node-0:391803] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x89837)[0x7fd1dc5f2837]
[node-0:391803] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x908ba)[0x7fd1dc5f98ba]
[node-0:391803] [ 5] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x76d)[0x7fd1dc60107d]
[node-0:391803] [ 6] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd2550)[0x7fd1bd228550]
[node-0:391803] [ 7] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd23ee)[0x7fd1bd2283ee]
[node-0:391803] [ 8] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd2204)[0x7fd1bd228204]
[node-0:391803] [ 9] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd1df7)[0x7fd1bd227df7]
[node-0:391803] [10] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xd189b)[0x7fd1bd22789b]
[node-0:391803] [11] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common10Controller17ConstructResponseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x1d20)[0x7fd1bd22cee4]
[node-0:391803] [12] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common10Controller19ComputeResponseListEbRNS0_18HorovodGlobalStateERNS0_10ProcessSetE+0x1c2f)[0x7fd1bd22a85d]
[node-0:391803] [13] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x10365e)[0x7fd1bd25965e]
[node-0:391803] [14] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0x102e54)[0x7fd1bd258e54]
[node-0:391803] [15] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZSt13__invoke_implIvPFvRN7horovod6common18HorovodGlobalStateEEJSt17reference_wrapperIS2_EEET_St14__invoke_otherOT0_DpOT1_+0x39)[0x7fd1bd26a578]
[node-0:391803] [16] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZSt8__invokeIPFvRN7horovod6common18HorovodGlobalStateEEJSt17reference_wrapperIS2_EEENSt15__invoke_resultIT_JDpT0_EE4typeEOS9_DpOSA_+0x4e)[0x7fd1bd265d68]
[node-0:391803] [17] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS4_EEEE9_M_invokeIJLm0ELm1EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE+0x43)[0x7fd1bd27f9f9]
[node-0:391803] [18] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS4_EEEEclEv+0x2c)[0x7fd1bd27f99a]
[node-0:391803] [19] /opt/.sin/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvRN7horovod6common18HorovodGlobalStateEESt17reference_wrapperIS5_EEEEEE6_M_runEv+0x1c)[0x7fd1bd27f91e]
[node-0:391803] [20] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44c0)[0x7fcdd6e614c0]
[node-0:391803] [21] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fd1dc3516db]
[node-0:391803] [22] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fd1dc68a61f]
[node-0:391803] *** End of error message ***

What's wrong? Thanks for your help in advance!