cannot load reward model from SFT model because of missing keys
DZ9 opened this issue · comments
I converted a llama model to nemo, with model dirs like below:
When I tried to load it to train a reward model, I got missing keys error. I load it from the default config, set load_base_model_only=True
, the total load code is as below:
ptl_model = load_from_nemo( reward_model_cls, cfg.model, trainer, strict=True, load_base_model_only=True, restore_path=cfg.pretrained_checkpoint.restore_from_path, )
And then I got the error below, any advice on how to load a pretrained non-reward model to train as a reward model in Nemo?
Error executing job with overrides: []
Traceback (most recent call last):
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 206, in load_sharded_object
loaded_obj = torch.load(load_path)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
with _open_file_like(f, 'rb') as opened_file:
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/checkpoint/binary/train_package/train_reward_model.py", line 68, in main
ptl_model = load_from_nemo(
File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 96, in load_from_nemo
model = cls.restore_from(
File "/checkpoint/binary/train_package/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
return super().restore_from(
File "/checkpoint/binary/train_package/nemo/core/classes/modelPT.py", line 450, in restore_from
instance = cls._save_restore_connector.restore_from(
File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 52, in restore_from
output = super().restore_from(*args, **kwargs)
File "/checkpoint/binary/train_package/nemo/collections/nlp/parts/nlp_overrides.py", line 1123, in restore_from
checkpoint = dist_checkpointing.load(
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 120, in load
sharded_objects, sharded_state_dict = load_sharded_objects(
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 221, in load_sharded_objects
return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
x[k] = dict_list_map_inplace(f, v)
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
x[k] = dict_list_map_inplace(f, v)
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
return f(x)
File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 218, in load_sharded_object
raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt not found
anybody can please help with this?
Did you try with strict=False
?
do you know if this is a mcore based model? and is this SFTed with aligner?
you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt
and metadata.json
Did you try with
strict=False
?
yes, it didn't work either
I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.
I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.
ah okay! that's good to know. can you elaborate on the change of megatron? was your model SFTed in a previous container?
To elaborate, it'd be helpful if you could share the exact steps you used when you said "I converted a llama model to nemo", so that we can reproduce the issue. Which container did you use and which commands did you run?