CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Runtime error when running examples (

youxiho1 opened this issue Β· comments

πŸ› Describe the bug

Hi I'm running the examples provided in the official github repo.
I just simply run the command "python"

However, I encountered into a runtime error

[RANK 0] Saving intermediate optimizer & model checkpoint into ckpts/checkpoint_1000
Traceback (most recent call last):
File "/home/user/workspace/trlx/examples/", line 140, in
File "/home/user/workspace/trlx/examples/", line 130, in main
File "/home/user/workspace/trlx/trlx/", line 142, in train
File "/home/user/workspace/trlx/trlx/trainer/", line 598, in learn
File "/home/user/workspace/trlx/trlx/trainer/", line 312, in save
self.accelerator.save_state(dst_dir, **kwargs)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/", line 2708, in save_state
save_location = save_accelerator_state(
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/", line 99, in save_accelerator_state
save(state, output_model_file, save_on_each_node=save_on_each_node, safe_serialization=safe_serialization)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/utils/", line 181, in save
save_func(obj, f)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/", line 467, in _flatten
raise RuntimeError(
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.lm_head.weight', 'base_model.shared.weight', 'base_model.decoder.embed_tokens.weight', 'base_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at

It seems like something about saving the model went into an error.

No idea about how to fix this. (Maybe I should revise the corresponding part of the source code of trlx???)
Thanks for your help!

Which trlX version are you using?


Additional system and package information

python 3.9.18, transformers 4.36.2, ubuntu 18.04

@youxiho1 Did you solve this problem?

I have the same issue as well, when I am running
I have an imperfect solution where I just don't save the optimizer and model during training.
config.train.save_best = False
config.train.save_optimizer = False