mlflow / mlflow

Open source platform for the machine learning lifecycle

Home Page:https://mlflow.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mlflow.tensorflow.MlflowCallback() cause freezing at exit _batch_status_check_threadpool.shutdown

JosephPenaQuino opened this issue · comments

Summary

I'm using:

  • Ubuntu 24.04
  • python 3.8.19
  • mlflow 2.14.0
  • tensorflow 2.3.4

When I use the mlflow.tensorflow.MLflowCallback function as a callback in fit function of keras, my program freezes when exiting.
The code below depicts how I used the callback:

model.fit(
    train_dataset,
    epochs=1,
    validation_data=test_dataset,
    callbacks=[
        mlflow.tensorflow.MlflowCallback(),
    ],
)

When the program freezes, and I press ctrl+c, the Python traceback is the following:

^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/my/python/venv/lib/python3.8/site-packages/mlflow/utils/async_logging/async_logging_queue.py", line 75, in _at_exit_callback
    self._batch_status_check_threadpool.shutdown(wait=True)
  File "/my/python/venv/lib/python3.8/concurrent/futures/thread.py", line 236, in shutdown
    t.join()
  File "/my/python/venv/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/my/python/venv/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/my/python/venv/lib/python3.8/site-packages/mlflow/utils/async_logging/async_logging_queue.py", line 75, in _at_exit_callback
    self._batch_status_check_threadpool.shutdown(wait=True)
  File "/my/python/venv/lib/python3.8/concurrent/futures/thread.py", line 236, in shutdown
    t.join()
  File "/my/python/venv/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/my/python/venv/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/my/python/venv/lib/python3.8/concurrent/futures/thread.py", line 40, in _python_exit
    t.join()
  File "/my/python/venv/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/my/python/venv/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

I perceived that in mlflow 2.13.2, I don't have this problem.

Notes

No response

@JosephPenaQuino thanks for reporting this. Can you provide the full code? Please make sure it imports all the required packages and define all the required variables to run.

This is strange, my recent change should not take any effect unless users are specifying the environment variable MLFLOW_ASYNC_LOGGING_WAITING_TIME, which is not yet publicly documented.

self._batch_status_check_threadpool.shutdown(wait=True) implies there are hanging threads that can't be completed.

Yes technically when every job is finished then this threadpool will shut down. I don't know why there is a regression. Need to take a closer look at the user's code

I wrote a code that suffers the same problem:

import tensorflow as tf
import mlflow as mlf

# basic data
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

mlf.set_tracking_uri('http://localhost:8080')
mlf.set_experiment('test')

# basic model
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)

with mlf.start_run():
    for i in range(2):
        with mlf.start_run(run_name=f'run_{i}', nested=True):
            model.fit(
                x_train,
                y_train,
                epochs=1,
                validation_data=(x_test, y_test),
                callbacks=[
                    mlf.tensorflow.MlflowCallback(),
                ],
            )

I tested with the same python package versions than before. I also tested with mlflow 2.14.1 and has the same problem.

@JosephPenaQuino thanks for the code.

I tested with the same python package versions than before

can you elaborate on this?

I used:

  • Ubuntu 24.04
  • python 3.8.19

And the following python packages:

  • mlflow 2.14.0
  • tensorflow 2.3.4

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.