Smarker / batchai-benchmark

Distributed training with Batch AI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fix model_save h5py error

Smarker opened this issue · comments

"/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 105, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 98, in h5py.h5f.create
IOError: Unable to create file (file signature not found)

Error was caused by multiple workers corrupting the h5py model file as they all were writing to the same file. To fix the issue, I added a check if hvd.rank() == 0: so that only worker with rank0 could save to the model file.