Why I cannot save model?

Question

Why I cannot save model?

txye opened this issue a year ago · comments

raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

Prasanthi N · Answer 1 · Tue Nov 28 2023 20:07:32 GMT+0800 (China Standard Time)

RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.encoder.embed_tokens.weight', '0.auto_model.shared.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

       Could you please help me to resolve this?

hongjin-su · Answer 2 · Tue Dec 19 2023 17:21:35 GMT+0800 (China Standard Time)

Hi, Thanks a lot for your interest in the INSTRUCTOR model!

Could you provide a short script for me to reproduce the error?

tushita_singh · Answer 3 · Tue Mar 19 2024 02:21:38 GMT+0800 (China Standard Time)

I am getting the same error! I don't know how to solve this @hongjin-su I hope you would help me in this:

Traceback (most recent call last):
File "/ClusterLLM/perspective/2_finetune/finetune.py", line 617, in
main()
File "ClusterLLM/perspective/2_finetune/finetune.py", line 598, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 2499, in _save_checkpoint
self.save_model(staging_output_dir, _internal_call=True)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 3016, in save_model
self._save(output_dir)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 3083, in _save
safetensors.torch.save_file(
File ".conda/envs/696ds/lib/python3.9/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File ".conda/envs/696ds/lib/python3.9/site-packages/safetensors/torch.py", line 477, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors