ctcyang / incubator-mxnet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segmentation fault in horovod.allreduce() and horovod.broadcast_parameters() functions

TonyTangYu opened this issue · comments

Description

I installed mxnet and Horovod by through source from here When I run a simple program to test the Horovod environment, I got a "segmentation fault" error.

Environment info (Required)

----------Python Info----------
('Version      :', '2.7.13')
('Compiler     :', 'GCC 4.4.7 20120313 (Red Hat 4.4.7-1)')
('Build        :', ('default', 'Dec 20 2016 23:09:15'))
('Arch         :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version      :', '9.0.1')
('Directory    :', '/home/anaconda2/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version      :', '1.5.0')
('Directory    :', '/home/horovod/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform     :', 'Linux-3.10.0-327.el7.x86_64-x86_64-with-redhat-7.2-Maipo')
('system       :', 'Linux')
('release      :', '3.10.0-327.el7.x86_64')
('version      :', '#1 SMP Thu Oct 29 17:29:29 EDT 2015')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6

Package used (Python/R/Scala/Julia):
I'm using Python.

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):gcc

MXNet commit hash:
No

Error Message:

(Paste the complete error message, including stack trace.)
No error information came out but segmentation fault .

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
I wrote a simple Horovod test program named 'test.py' and it is shown below.

import numpy as np
import mxnet as mx
import horovod.mxnet as hvd
hvd.init()
r=int(hvd.rank())
print("r:", r)
x=mx.nd.ones((2,3,4), dtype=np.float16)
print("x:", x)
y=hvd.allreduce(x)
print("y", y)

Steps to reproduce

(Paste the commands you ran that produced the error.)
Only a simple commands on the terminal python test.py

What have you tried to solve it?

I located where the segmentation error cames. It is because of the allreduce function. Besides, the function 'broadcast_parameters' would also cause segmentation fault.
Could someone help me fix it? Thanks in advance!