huggingface / jat

General multi-task deep RL Agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training: Error while saving checkpoint during Training (via save steps)

drdsgvo opened this issue · comments

With transformers 4.41.0., Ubuntu 22.0

Calling the training script scripts/train_jat_tokenized.py as given (with --per_device_train_batch_size 1 and one GPU) the following error comes when the system tries to save the first checkpoint:

from trainer.train(..) in above script, end of file:
File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2732, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2811, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3355, in save_model
self._save(output_dir)
File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3432, in _save
self.model.save_pretrained(
File "/home/km/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2574, in save_pretrained
raise RuntimeError(
RuntimeError: The weights trying to be saved contained shared tensors [{'transformer.wte.weight', 'single_discrete_encoder.weight', 'multi_discrete_encoder.0.weight'}] that are mismatching the transformers base configuration. Try saving using safe_serialization=False or remove this tensor sharing.

The error comes up with using accelerate launch and without (just using python <script>

I could not figure out how "safe_serialization=False`" would help.
Any ideas? Thank you.

@qgallouedec Hey Quentin, I've just run into this as well. It seems like this is happening because self.single_discrete_encoder (which is self.transformer.wte) is shared among other layers like self.multi_discrete_encoder. Then, within the transformers trainer, when it saves the model, it can't seem to handle these shared tensors. Any help would be greatly appreciated! Should I rewrite the save function in the transformers trainer in the trainer subclass?

For now, I've simply added the following arg to the training command: --save_safetensors 0