Colab problem: continue previous training

olaviinha opened this issue · comments

I am using Colab (w/ %tensorflow_version 1.x) to run it and Google Drive to store all the related data.

It starts training from step 0 every time (along with a bunch of warnings) despite seemingly finding and restoring a previous checkpoint correctly in the beiginning.

Has anybody had any luck in continuing previous training in Colab?

Trying to restore saved checkpoints from /<logdir_root>/train/2020-07-20T11-44-41/ ...  Checkpoint found: /<logdir_root>/train/2020-07-20T11-44-41/model.ckpt-1396
  Global step was: 1396
  Restoring... Done.
WARNING:tensorflow:From train.py:289: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:`tf.train.start_queue_runners()` was called when no queue runners were defined. You can safely remove the call to this deprecated function.
files length: 4
2020-07-21 15:07:46.884203: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-21 15:07:47.979702: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
step 0 - loss = 1.931, (20.117 sec/step)
Storing checkpoint to /<logdir_root>/train/2020-07-21T15-07-00 ...WARNING:tensorflow:Issue encountered when serializing variables.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'filter_bias' has type str, but expected one of: int, long, bool
WARNING:tensorflow:Issue encountered when serializing trainable_variables.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'filter_bias' has type str, but expected one of: int, long, bool
step 1 - loss = 1.902, (0.692 sec/step)
step 2 - loss = 1.954, (0.692 sec/step)
step 3 - loss = 1.895, (0.692 sec/step)

@olaviinha Hi, having the same warning -not only on colab though.

A quick workaround for restoring the model properly is to set the --logdir arg to the dir where your model was saved, and not use --restore_from.
--logdir = logdir/train/model_dir
That worked for me, restores the global step and cntinues where it left off, in THAT folder.

However I cannot figure out the warning! back in April I trained with tf 1.3 and everything was ok.. I suspect that between versions 1.3 --> 1.15(the one that's on colab) there have been changes to the Saver class. So I'm looking into that..
Did you manage to resolve it?

@ileanna hi! --logdir is an argument of which method?

I am doing

python train.py --data_dir=MY_PATH --logdir = /content/tensorflow-wavenet/logdir/train/2021-01-13T17-47-15/model.ckpt-200

but it wont work...

@nschmidtg hello! as --logdir specifies the directory where train logs are stored you need to point to a folder.

So in your case it should be
python train.py --data_dir=MY_PATH --logdir = /content/tensorflow-wavenet/logdir/train/2021-01-13T17-47-15/
so without the model.ckpt-200 file.. only the folder with the checkpoints.

I hope it works like this!

Thanks! It did work!