Training gets stuck after some epochs when using Tensorflow with multiprocessing
n-Guard opened this issue · comments
Describe the bug
I'm using Keras/Tensorflow and the training stalls indefinitely after some epochs when I enable multiprocessing.
It happend only when I used LSTM or TimeDistributed layers. Dense and Conv layers alone don't seem to have this problem.
Without ClearML everything works fine.
To reproduce
Start a training with Tensorflow and multiprocessing enabled.
Choose a model with LSTM and/or TimeDistributed layers.
I provided a script, the bug happens mostly within the first 100 epochs:
https://gist.github.com/n-Guard/0f5d568cfedb3a22bfa56785e82961ad
Expected behaviour
The training should continue without getting stuck.
Environment
- Server type: self hosted
- ClearML SDK Version:
clearml-agent==1.7.0
clearml==1.14.4
- ClearML Server Version:
1.14.1
- Tensorflow Version:
2.15.0
- Python Version:
3.11
- OS: Linux
Hi @n-Guard ! We managed to reproduce this. It is not clear why it happens. In the meantime, you could try calling the following snippet at the very beginning of your script:
try:
import multiprocessing
multiprocessing.set_start_method("spawn")
except Exception:
pass
What it does: it makes python use spawn instead of fork when creating a new process, so the state of the locks, queues etc will not be copied on the child processes.
Not 100% sure if it will help.