03_03_vae_digits_train: TypeError: unsupported format string passed to numpy.ndarray.__format__
jdinkla opened this issue · comments
I am running on Ubuntu 18.04 with Python 3.6.9 and when running 03_03_vae_digits_train I encounter the following error:
vae.train(
x_train
, batch_size = BATCH_SIZE
, epochs = EPOCHS
, run_folder = RUN_FOLDER
, print_every_n_batches = PRINT_EVERY_N_BATCHES
, initial_epoch = INITIAL_EPOCH
)
I installed using the newest pip with pip install -r requirements.txt
and no errors occured and i had to install graphviz.
BTW numpy is 1.17.2 as required.
$ pip freeze | grep numpy
numpy==1.17.2
`
```log
Epoch 1/200
1874/1875 [============================>.] - ETA: 0s - loss: 58.4866 - reconstruction_loss: 55.2065 - kl_loss: 3.2801
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-a0cdb3ff19b5> in <module>
5 , run_folder = RUN_FOLDER
6 , print_every_n_batches = PRINT_EVERY_N_BATCHES
----> 7 , initial_epoch = INITIAL_EPOCH
8 )
~/GDL_code/models/VAE.py in train(self, x_train, batch_size, epochs, run_folder, print_every_n_batches, initial_epoch, lr_decay)
224 , epochs = epochs
225 , initial_epoch = initial_epoch
--> 226 , callbacks = callbacks_list
227 )
228
~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
64 def _method_wrapper(self, *args, **kwargs):
65 if not self._in_multi_worker_mode(): # pylint: disable=protected-access
---> 66 return method(self, *args, **kwargs)
67
68 # Running inside `run_distribute_coordinator` already.
~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
874 epoch_logs.update(val_logs)
875
--> 876 callbacks.on_epoch_end(epoch, epoch_logs)
877 if self.stop_training:
878 break
~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py in on_epoch_end(self, epoch, logs)
363 logs = self._process_logs(logs)
364 for callback in self.callbacks:
--> 365 callback.on_epoch_end(epoch, logs)
366
367 def on_train_batch_begin(self, batch, logs=None):
~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py in on_epoch_end(self, epoch, logs)
1175 self._save_model(epoch=epoch, logs=logs)
1176 else:
-> 1177 self._save_model(epoch=epoch, logs=logs)
1178 if self.model._in_multi_worker_mode():
1179 # For multi-worker training, back up the weights and current training
~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py in _save_model(self, epoch, logs)
1194 int) or self.epochs_since_last_save >= self.period:
1195 self.epochs_since_last_save = 0
-> 1196 filepath = self._get_file_path(epoch, logs)
1197
1198 try:
~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py in _get_file_path(self, epoch, logs)
1242 # `{mape:.2f}`. A mismatch between logged metrics and the path's
1243 # placeholders can cause formatting to fail.
-> 1244 return self.filepath.format(epoch=epoch + 1, **logs)
1245 except KeyError as e:
1246 raise KeyError('Failed to format this callback filepath: "{}". '
TypeError: unsupported format string passed to numpy.ndarray.__format__
On the tensorflow_2 branch.
It works on the master branch!
I change this line like below.
- checkpoint_filepath=os.path.join(run_folder, "weights/weights-{epoch:03d}-{loss:.2f}.h5")
+ checkpoint_filepath=os.path.join(run_folder, "weights/weights.h5")
Then I can run 03_03_vae_digits_train with no error.
I create google colab notebook based on 03_03_vae_digits_train.
I hope this notebook helps you.
Considering the Code around this line:
checkpoint_filepath=os.path.join(run_folder, "weights/weights-{epoch:03d}-{loss:.2f}.h5") checkpoint1 = ModelCheckpoint(checkpoint_filepath, save_weights_only = True, verbose=1) checkpoint2 = ModelCheckpoint(os.path.join(run_folder, 'weights/weights.h5'), save_weights_only = True, verbose=1)
replacing the "weights/weights-{epoch:03d}-{loss:.2f}.h5"
with "weights/weights.h5"
is sort of pointless, because checkpoint1
and checkpoint2
would be exactly the same...
I tried to figure out what exactly caused the problem but I'm quite unfamiliar with formatting, so I have kind of an idea what {epoch:03d}-{loss:.2f}
does (putting a variable 'epoch' formatted with a leading 0 and 3 digits and a variable 'loss' with 2 decimal places into the string?) but not why. So I'm having the same issue and would be very grateful for a fix. Also branch tensorflow_2
I faced this same problem. As far as I can tell, the error occurs due to the fact that the return of the loss function is rewritten in the form of a dictionary. To avoid the error, you can remove the last {loss:.2f}
In my case:
checkpoint_filepath=os.path.join(run_folder, "weights/weights-{epoch:02d}.h5")
However, in the module "03_04_vae_digits_analysis" I came across the fact that the saved weights in h5 are not loaded into the model. Therefore, I save the weights in .ckpt format.
Working on TF2 branch https://github.com/kubokoHappy/GDL_code_kuboko
Using TF 2.3 with gpu
The problem is that the loss value is a vector of batch size, so it is required to calculate its mean.
This fragment:
return {
"loss": total_loss,
"reconstruction_loss": reconstruction_loss,
"kl_loss": kl_loss,
}
should be replaced by this:
return {
"loss": tf.reduce_mean(total_loss),
"reconstruction_loss": tf.reduce_mean(reconstruction_loss),
"kl_loss": tf.reduce_mean(kl_loss),
}