Training gets stuck after some epochs when using Tensorflow with multiprocessing

Question

Training gets stuck after some epochs when using Tensorflow with multiprocessing

n-Guard opened this issue 5 months ago · comments

Describe the bug

I'm using Keras/Tensorflow and the training stalls indefinitely after some epochs when I enable multiprocessing.
It happend only when I used LSTM or TimeDistributed layers. Dense and Conv layers alone don't seem to have this problem.
Without ClearML everything works fine.

To reproduce

Start a training with Tensorflow and multiprocessing enabled.
Choose a model with LSTM and/or TimeDistributed layers.

I provided a script, the bug happens mostly within the first 100 epochs:
https://gist.github.com/n-Guard/0f5d568cfedb3a22bfa56785e82961ad

Expected behaviour

The training should continue without getting stuck.

Environment

Server type: self hosted
ClearML SDK Version: clearml-agent==1.7.0 clearml==1.14.4
ClearML Server Version: 1.14.1
Tensorflow Version: 2.15.0
Python Version: 3.11
OS: Linux

eajechiloae · Answer 1 · Wed Apr 10 2024 03:10:17 GMT+0800 (China Standard Time)

Hi @n-Guard ! We managed to reproduce this. It is not clear why it happens. In the meantime, you could try calling the following snippet at the very beginning of your script:

try:
    import multiprocessing
    multiprocessing.set_start_method("spawn")
except Exception:
    pass

What it does: it makes python use spawn instead of fork when creating a new process, so the state of the locks, queues etc will not be copied on the child processes.
Not 100% sure if it will help.