NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cannot load reward model from SFT model because of missing keys

DZ9 opened this issue · comments

I converted a llama model to nemo, with model dirs like below:
image
When I tried to load it to train a reward model, I got missing keys error. I load it from the default config, set load_base_model_only=True, the total load code is as below:

ptl_model = load_from_nemo( reward_model_cls, cfg.model, trainer, strict=True, load_base_model_only=True, restore_path=cfg.pretrained_checkpoint.restore_from_path, )

And then I got the error below, any advice on how to load a pretrained non-reward model to train as a reward model in Nemo?

Error executing job with overrides: []
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 206, in load_sharded_object
    loaded_obj = torch.load(load_path)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/checkpoint/binary/train_package/train_reward_model.py", line 68, in main
    ptl_model = load_from_nemo(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 96, in load_from_nemo
    model = cls.restore_from(
  File "/checkpoint/binary/train_package/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/checkpoint/binary/train_package/nemo/core/classes/modelPT.py", line 450, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 52, in restore_from
    output = super().restore_from(*args, **kwargs)
  File "/checkpoint/binary/train_package/nemo/collections/nlp/parts/nlp_overrides.py", line 1123, in restore_from
    checkpoint = dist_checkpointing.load(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 120, in load
    sharded_objects, sharded_state_dict = load_sharded_objects(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 221, in load_sharded_objects
    return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
    return f(x)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 218, in load_sharded_object
    raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt not found

anybody can please help with this?

Did you try with strict=False?

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

Did you try with strict=False?

yes, it didn't work either

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

yes it is a mcore based model
image

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

ah okay! that's good to know. can you elaborate on the change of megatron? was your model SFTed in a previous container?

To elaborate, it'd be helpful if you could share the exact steps you used when you said "I converted a llama model to nemo", so that we can reproduce the issue. Which container did you use and which commands did you run?