Cache is not working with Horovod
geekypathak21 opened this issue · comments
Himanshu Pathak commented
When tried to create cache with horovod facing errors. Running on 4gpu
[1,1]<stderr>: return fn(*args)
[1,1]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
[1,1]<stderr>: target_list, run_metadata)
[1,1]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
[1,1]<stderr>: run_metadata)
[1,1]<stderr>:tensorflow.python.framework.errors_impl.AlreadyExistsError: 2 root error(s) found.
[1,1]<stderr>: (0) Already exists: There appears to be a concurrent caching iterator running - cache lockfile already exists ('feature_cache_0.lockfile'). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator. Lockfile contents: Created at: 1629874223
[1,1]<stderr>: [[{{node IteratorGetNext}}]]
[1,1]<stderr>: [[IteratorGetNext/_41]]
[1,1]<stderr>: (1) Already exists: There appears to be a concurrent caching iterator running - cache lockfile already exists ('feature_cache_0.lockfile'). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator. Lockfile contents: Created at: 1629874223
[1,1]<stderr>: [[{{node IteratorGetNext}}]]
[1,1]<stderr>:0 successful operations.
[1,1]<stderr>:0 derived errors ignored.
[1,1]<stderr>:
I think this is becuase we are running 4 processes with mpi and every process is trying to create their own cache.
Francis Tyers commented
See #3693