horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page:http://horovod.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Early Stopping tf.keras Crashes

AllardJM opened this issue · comments

I am using Sagemaker and Horovod with Tensorflow Keras and the error I am seeing suggests that when the rank 0 process ceases due to early stopping, the other processes continue and then crashes when they try to communicate with the stopped process.

I am using keras.fit() and a call back, added to rank 0:

    if hvd.rank() == 0:
        callbacks.append(EarlyStopping(monitor="val_factorized_top_k/top_10_categorical_accuracy", patience=2, mode='max', verbose=1, restore_best_weights=True, start_from_epoch=1))

......

tf_model.fit(interactions.batch(batchsize), 
                 epochs = epochs, 
                 callbacks = callbacks,
                 validation_data = val_ds,
                 verbose = 1 if hvd.rank() == 0 else 0
                 )

How can the early stopping be communicated to the other processes to avoid this? Is there anything else that needs to be done to ensure that the model is in sync when the rank 0 does validation and potentially stops?

Error:
AlgorithmError: UnknownError: ExitCode 1 ErrorMessage "tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__HorovodAllreduce_device_/job:localhost/replica:0/task:0/device:CPU:0}} Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. [Op:HorovodAllreduce] 2024-03-08 02:22:51.376469: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at mpi_ops.cc:497 : UNKNOWN: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. Traceback (most recent call last) File "/usr/local/lib/python3.9/