fix model_save h5py error
Smarker opened this issue · comments
"/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 105, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 98, in h5py.h5f.create
IOError: Unable to create file (file signature not found)
Error was caused by multiple workers corrupting the h5py model file as they all were writing to the same file. To fix the issue, I added a check if hvd.rank() == 0:
so that only worker with rank0 could save to the model file.