Tensorflow Saved model not portable with latest tf.keras.optimizers

Question

Tensorflow Saved model not portable with latest tf.keras.optimizers

supercharleszhu opened this issue 3 months ago · comments

Chen Zhu commented 3 months ago

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet) Tensorflow
Framework version: 2.11
Horovod version: 0.28.1
MPI version: N/A
CUDA version: 11.2 (tested in CPU version)
NCCL version: 11.2
Python version: 3.10
Spark / PySpark version: N/A
Ray version: N/A
OS and version: CentOS 7
GCC version: 11.2.0
CMake version:

Checklist:

Did you search issues to find if somebody asked this question before?
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the troubleshooting guide?

Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.

We met an issue after running TF Training w/ horovod in both CPU and GPU execution. The tf saved model is not loadable outside Horovod environment because HorovodAllReduce seems to be saved unexpected.

Ways to reproduce: running the following script for a simple keras model in the test case and saving it

# test.py
import horovod.tensorflow as hvd
import tensorflow as tf
import keras
import numpy as np


hvd.init()
initial_lr = 0.1 * hvd.size()
opt = tf.keras.optimizers.Adam()
opt = hvd.DistributedOptimizer(opt)

def linear_multiplier(epoch):
    return epoch

model = keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(3,)))
model.add(keras.layers.RepeatVector(3))
model.add(keras.layers.ThresholdedReLU(0.5))
model.compile(loss=keras.losses.mean_squared_error,
                optimizer=opt,
                metrics=[keras.metrics.categorical_accuracy],
                experimental_run_tf_function=False)
x = np.random.random((10, 3))
y = np.random.random((10, 3, 2))


train_history = model.fit(x,
                            y,
                            steps_per_epoch=5,
                            epochs=20)

# test that the metrics average is being respected
loss_metrics = train_history.history["loss"]
loss_metrics_tensor = tf.convert_to_tensor(
    loss_metrics, dtype=tf.float32)
expected_loss_metrics_tensor = hvd.broadcast(
    loss_metrics_tensor, root_rank=0)

if hvd.rank() == 0:
    tf.saved_model.save(model, "test_space/hvd_saved_model_2")

and run python test.py

Then loading the model without horovd being imported

# test_2.py
import tensorflow as tf
tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_2")

and run python test_2.py

it will return

Traceback (most recent call last):
  File "/home/chzhu/test_space/test_tf_saved_model.py", line 3, in <module>
    tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_1")
  File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 828, in load
    result = load_partial(export_dir, None, tags, options)["root"]
  File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 961, in load_partial
    raise FileNotFoundError(
FileNotFoundError: Op type not registered 'HorovodAllreduce' in binary running on chzhu-ld4.linkedin.biz. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
 You may be trying to load on a different device from the computational device. Consider setting the `experimental_io_device` option in `tf.saved_model.LoadOptions` to the io_device such as '/job:localhost'.

Note: Reverting to Horovod 0.26 or tf.keras.optimizer.legacy will resolve this issue. But we want to use latest horovod instead.