mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cache is not working with Horovod

geekypathak21 opened this issue · comments

When tried to create cache with horovod facing errors. Running on 4gpu

[1,1]<stderr>:    return fn(*args)
[1,1]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
[1,1]<stderr>:    target_list, run_metadata)
[1,1]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
[1,1]<stderr>:    run_metadata)
[1,1]<stderr>:tensorflow.python.framework.errors_impl.AlreadyExistsError: 2 root error(s) found.
[1,1]<stderr>:  (0) Already exists: There appears to be a concurrent caching iterator running - cache lockfile already exists ('feature_cache_0.lockfile'). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator. Lockfile contents: Created at: 1629874223
[1,1]<stderr>:	 [[{{node IteratorGetNext}}]]
[1,1]<stderr>:	 [[IteratorGetNext/_41]]
[1,1]<stderr>:  (1) Already exists: There appears to be a concurrent caching iterator running - cache lockfile already exists ('feature_cache_0.lockfile'). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator. Lockfile contents: Created at: 1629874223
[1,1]<stderr>:	 [[{{node IteratorGetNext}}]]
[1,1]<stderr>:0 successful operations.
[1,1]<stderr>:0 derived errors ignored.
[1,1]<stderr>: 

I think this is becuase we are running 4 processes with mpi and every process is trying to create their own cache.