CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Runtime error when running examples (ilql_sentiments_t5.py)

youxiho1 opened this issue Β· comments

πŸ› Describe the bug

Hi I'm running the examples provided in the official github repo.
I just simply run the command "python ilql_sentiments_t5.py"

However, I encountered into a runtime error

[RANK 0] Saving intermediate optimizer & model checkpoint into ckpts/checkpoint_1000
Traceback (most recent call last):
File "/home/user/workspace/trlx/examples/ilql_sentiments_t5.py", line 140, in
main()
File "/home/user/workspace/trlx/examples/ilql_sentiments_t5.py", line 130, in main
trlx.train(
File "/home/user/workspace/trlx/trlx/trlx.py", line 142, in train
trainer.learn()
File "/home/user/workspace/trlx/trlx/trainer/accelerate_base_trainer.py", line 598, in learn
self.save(directory)
File "/home/user/workspace/trlx/trlx/trainer/accelerate_base_trainer.py", line 312, in save
self.accelerator.save_state(dst_dir, **kwargs)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/accelerator.py", line 2708, in save_state
save_location = save_accelerator_state(
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/checkpointing.py", line 99, in save_accelerator_state
save(state, output_model_file, save_on_each_node=save_on_each_node, safe_serialization=safe_serialization)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/utils/other.py", line 181, in save
save_func(obj, f)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/torch.py", line 467, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.lm_head.weight', 'base_model.shared.weight', 'base_model.decoder.embed_tokens.weight', 'base_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

It seems like something about saving the model went into an error.

No idea about how to fix this. (Maybe I should revise the corresponding part of the source code of trlx???)
Thanks for your help!

Which trlX version are you using?

trlx==0.7.0

Additional system and package information

python 3.9.18, transformers 4.36.2, ubuntu 18.04

@youxiho1 Did you solve this problem?

I have the same issue as well, when I am running ppo_sentiments.py
I have an imperfect solution where I just don't save the optimizer and model during training.
config.train.save_best = False
config.train.save_optimizer = False