jamt9000 / DVE

Descriptor Vector Exchange

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Evaluation fails when training terminates before max epoch set in config

schuhschuh opened this issue · comments

I've trained the SmallNet with 3D descriptors using the config file linked from the README.

For me, the training terminated already after 70 epochs instead of 100:

Running validation for epoch 71
validation epoch took 00h00m13s
    epoch          : 71
    loss           : 1.8158737950681234
    val_loss       : 1.765822774887085
Val performance didn't improve for 10 epochs. Training stops.

This resulted in the following error when the train.py script attempted to load the checkpoint for epoch 100 (value from config file) for the mini-evaluation:

Loading checkpoint: saved/models/celeba-smallnet-3d-dve-2019-08-08_17-54-21/2019-09-18_17-09-50/checkpoint-epoch100.pth ...
Traceback (most recent call last):
  File "train.py", line 241, in <module>
    main(config, args.resume)
  File "train.py", line 176, in main
    evaluation(config, logger=logger)
  File "/data/aschuh/source/dve/test_matching.py", line 171, in evaluation
    checkpoint = torch.load(ckpt_path)
  File "/data/aschuh/tools/pyenv/versions/dve/lib/python3.6/site-packages/torch/serialization.py", line 384, in load
    f = f.open('rb')
  File "/data/aschuh/tools/pyenv/versions/3.6.9/lib/python3.6/pathlib.py", line 1183, in open
    opener=self._opener)
  File "/data/aschuh/tools/pyenv/versions/3.6.9/lib/python3.6/pathlib.py", line 1037, in _opener
    return self._accessor.open(self, flags, mode)
  File "/data/aschuh/tools/pyenv/versions/3.6.9/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: 'saved/models/celeba-smallnet-3d-dve-2019-08-08_17-54-21/2019-09-18_17-09-50/check
point-epoch100.pth'

The BaseTrainer should probably store the last epoch (or change self.epochs) in this case (cf.

self.logger.info("Val performance didn\'t improve for {} epochs. "
) and that value be used in train.py at

DVE/train.py

Line 173 in e6e0cdb

epoch = config["trainer"]["epochs"]
instead of the value from the config file.

Thanks for this! Will push a fix shortly.

Thanks for flagging this @schuhschuh. I've pushed a change which essentially disables early stopping (since we did not use it in our experiments in the paper) and closed the issue, but feel free to re-open if you hit further issues.