Model save not working

Question

Model save not working

jpilaul opened this issue 3 years ago · comments

There are a few checkpoint_callback being created in lighting_base.py and I think that using the callback on line https://github.com/XiangLi1999/PrefixTuning/blob/cleaned/seq2seq/lightning_base.py#L749 does allow us to save the model. I am rerunning the model right now to verify without the line. However, since it takes a long time to train, I was hoping that you can help me fix model saving.
Thanks

XiangLi1999 · Answer 1 · Tue Aug 03 2021 04:46:38 GMT+0800 (China Standard Time)

I wonder what's the issue with model saving? Could you be more specific? is it not saving any models? If so, I think you could check the version of lightning you installed. I think pytorch-lightning==0.8.5 should work!

edit: should be pytorch-lightning==0.9.0. NOT 0.8.5

Jonathan · Answer 2 · Wed Aug 04 2021 09:59:40 GMT+0800 (China Standard Time)

Nope that doesn't work either. I tried pytorch-lightning==0.8.5 and reverted changes that I had made to make the code run. I am getting the following error with your current code version:

Traceback (most recent call last):
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/finetune.py", line 879, in <module>
    main(args)
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/finetune.py", line 787, in main
    logger=logger,
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/lightning_base.py", line 792, in generic_train
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 992, in fit
    results = self.spawn_ddp_children(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 462, in spawn_ddp_children
    results = self.ddp_train(local_rank, q=None, model=model, is_master=True)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
    self.train()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
    self.run_training_epoch()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 470, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 430, in run_evaluation
    self.on_validation_end()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 112, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 12, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 318, in on_validation_end
    self._save_model(filepath)
TypeError: _save_model() missing 2 required positional arguments: 'trainer' and 'pl_module'
Exception ignored in: <bound method tqdm.__del__ of <tqdm.asyncio.tqdm_asyncio object at 0x7f3cc022f4a8>>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1138, in __del__
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1285, in close
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1478, in display
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1141, in __str__
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1436, in format_dict
TypeError: 'NoneType' object is not iterable

XiangLi1999 · Answer 3 · Wed Aug 04 2021 11:17:25 GMT+0800 (China Standard Time)

I still think this is a package version issue. Here is a solution. Could you configure your virtual env using this docker image? xlisali/xlisali:prefix3

luxuantao · Answer 4 · Thu Aug 12 2021 17:14:51 GMT+0800 (China Standard Time)

Nope that doesn't work either. I tried pytorch-lightning==0.8.5 and reverted changes that I had made to make the code run. I am getting the following error with your current code version:

Traceback (most recent call last):
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/finetune.py", line 879, in <module>
    main(args)
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/finetune.py", line 787, in main
    logger=logger,
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/lightning_base.py", line 792, in generic_train
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 992, in fit
    results = self.spawn_ddp_children(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 462, in spawn_ddp_children
    results = self.ddp_train(local_rank, q=None, model=model, is_master=True)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
    self.train()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
    self.run_training_epoch()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 470, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 430, in run_evaluation
    self.on_validation_end()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 112, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 12, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 318, in on_validation_end
    self._save_model(filepath)
TypeError: _save_model() missing 2 required positional arguments: 'trainer' and 'pl_module'
Exception ignored in: <bound method tqdm.__del__ of <tqdm.asyncio.tqdm_asyncio object at 0x7f3cc022f4a8>>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1138, in __del__
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1285, in close
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1478, in display
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1141, in __str__
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1436, in format_dict
TypeError: 'NoneType' object is not iterable

I have the same problem

XiangLi1999 · Answer 5 · Fri Aug 13 2021 14:08:24 GMT+0800 (China Standard Time)

Could you try pip install pytorch-lightning==0.9.0 and let me know if this solves the problem
(I will edit my previous post if this solves the problem for both of you)!

(Side note: I tried to look into the problem and realized that I used 0.9.0 version. Previously used `conda env export' but that's not printing the right version of pytorch-lightning that I actually used. )

luxuantao · Answer 6 · Fri Aug 13 2021 19:22:10 GMT+0800 (China Standard Time)

Could you try pip install pytorch-lightning==0.9.0 and let me know if this solves the problem
(I will edit my previous post if this solves the problem for both of you)!

(Side note: I tried to look into the problem and realized that I used 0.9.0 version. Previously used `conda env export' but that's not printing the right version of pytorch-lightning that I actually used. )

It works! Thanks!