allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training gets stuck after some epochs when using Tensorflow with multiprocessing

n-Guard opened this issue · comments

Describe the bug

I'm using Keras/Tensorflow and the training stalls indefinitely after some epochs when I enable multiprocessing.
It happend only when I used LSTM or TimeDistributed layers. Dense and Conv layers alone don't seem to have this problem.
Without ClearML everything works fine.

To reproduce

Start a training with Tensorflow and multiprocessing enabled.
Choose a model with LSTM and/or TimeDistributed layers.

I provided a script, the bug happens mostly within the first 100 epochs:
https://gist.github.com/n-Guard/0f5d568cfedb3a22bfa56785e82961ad

Expected behaviour

The training should continue without getting stuck.

Environment

  • Server type: self hosted
  • ClearML SDK Version: clearml-agent==1.7.0 clearml==1.14.4
  • ClearML Server Version: 1.14.1
  • Tensorflow Version: 2.15.0
  • Python Version: 3.11
  • OS: Linux

Hi @n-Guard ! We managed to reproduce this. It is not clear why it happens. In the meantime, you could try calling the following snippet at the very beginning of your script:

try:
    import multiprocessing
    multiprocessing.set_start_method("spawn")
except Exception:
    pass

What it does: it makes python use spawn instead of fork when creating a new process, so the state of the locks, queues etc will not be copied on the child processes.
Not 100% sure if it will help.