Tensorflow Saved model not portable with latest tf.keras.optimizers
supercharleszhu opened this issue · comments
Environment:
- Framework: (TensorFlow, Keras, PyTorch, MXNet) Tensorflow
- Framework version: 2.11
- Horovod version: 0.28.1
- MPI version: N/A
- CUDA version: 11.2 (tested in CPU version)
- NCCL version: 11.2
- Python version: 3.10
- Spark / PySpark version: N/A
- Ray version: N/A
- OS and version: CentOS 7
- GCC version: 11.2.0
- CMake version:
Checklist:
- Did you search issues to find if somebody asked this question before?
- If your question is about hang, did you read this doc?
- If your question is about docker, did you read this doc?
- Did you check if you question is answered in the troubleshooting guide?
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
We met an issue after running TF Training w/ horovod in both CPU and GPU execution. The tf saved model is not loadable outside Horovod environment because HorovodAllReduce seems to be saved unexpected.
Ways to reproduce: running the following script for a simple keras model in the test case and saving it
# test.py
import horovod.tensorflow as hvd
import tensorflow as tf
import keras
import numpy as np
hvd.init()
initial_lr = 0.1 * hvd.size()
opt = tf.keras.optimizers.Adam()
opt = hvd.DistributedOptimizer(opt)
def linear_multiplier(epoch):
return epoch
model = keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(3,)))
model.add(keras.layers.RepeatVector(3))
model.add(keras.layers.ThresholdedReLU(0.5))
model.compile(loss=keras.losses.mean_squared_error,
optimizer=opt,
metrics=[keras.metrics.categorical_accuracy],
experimental_run_tf_function=False)
x = np.random.random((10, 3))
y = np.random.random((10, 3, 2))
train_history = model.fit(x,
y,
steps_per_epoch=5,
epochs=20)
# test that the metrics average is being respected
loss_metrics = train_history.history["loss"]
loss_metrics_tensor = tf.convert_to_tensor(
loss_metrics, dtype=tf.float32)
expected_loss_metrics_tensor = hvd.broadcast(
loss_metrics_tensor, root_rank=0)
if hvd.rank() == 0:
tf.saved_model.save(model, "test_space/hvd_saved_model_2")
and run python test.py
Then loading the model without horovd being imported
# test_2.py
import tensorflow as tf
tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_2")
and run python test_2.py
it will return
Traceback (most recent call last):
File "/home/chzhu/test_space/test_tf_saved_model.py", line 3, in <module>
tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_1")
File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 828, in load
result = load_partial(export_dir, None, tags, options)["root"]
File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 961, in load_partial
raise FileNotFoundError(
FileNotFoundError: Op type not registered 'HorovodAllreduce' in binary running on chzhu-ld4.linkedin.biz. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
You may be trying to load on a different device from the computational device. Consider setting the `experimental_io_device` option in `tf.saved_model.LoadOptions` to the io_device such as '/job:localhost'.
Note: Reverting to Horovod 0.26 or tf.keras.optimizer.legacy will resolve this issue. But we want to use latest horovod instead.