How to continue train?

Question

How to continue train?

ygjwd12345 opened this issue 3 years ago · comments

when I use script llike

CUDA_VISIBLE_DEVICES=0 python3 -u trainUDA_gta.py --config ./configs/configUDA_gta2city.json --name UDA-gta --resume /saved/DeepLabv2-depth-gtamono-cityscapestereo/05-03_02-13-UDA-gta/checkpoint-iter95000.pth | tee ./gta-corda.log

It would run again but the new checkpoint would be saved.

Qin Wang · Answer 1 · Sat May 08 2021 01:22:34 GMT+0800 (China Standard Time)

Hi.
The training skeleton is directly from DACS, we didn't test the resume function. We trained the model uninterrupted for 250000 iterations.
For your specific use case, maybe this can help:

change "--resume /saved/DeepLabv2-depth-gtamono-cityscapestereo/05-03_02-13-UDA-gta/checkpoint-iter95000.pth" to "--resume ../saved/DeepLabv2-depth-gtamono-cityscapestereo/05-03_02-13-UDA-gta/checkpoint-iter95000.pth" as the default save folder is one level up. The new checkpoints should show up in ../saved/DeepLabv2-depth-gtamono-cityscapestereo/05-03_02-13-UDA-gta-resume/

We didn't test this and maybe it is easier to train from scratch for 250000 to reproduce the results.
Please let me know if you have further questions.

stanley · Answer 2 · Sat May 08 2021 04:52:20 GMT+0800 (China Standard Time)

I find the error causing by
if args.resume: checkpoint_dir = os.path.join(*args.resume.split('/')[:-1]) + '_resume-' + start_writeable else: checkpoint_dir = os.path.join(config['utils']['checkpoint_dir'], start_writeable + '-' + args.name)
I remove
`` if args.resume:
checkpoint_dir = os.path.join(*args.resume.split('/')[:-1]) + '_resume-' + start_writeable
else:`
The problem is solved.