cannot load reward model from SFT model because of missing keys

Question

cannot load reward model from SFT model because of missing keys

DZ9 opened this issue 6 months ago · comments

I converted a llama model to nemo, with model dirs like below:

When I tried to load it to train a reward model, I got missing keys error. I load it from the default config, set load_base_model_only=True, the total load code is as below:

ptl_model = load_from_nemo( reward_model_cls, cfg.model, trainer, strict=True, load_base_model_only=True, restore_path=cfg.pretrained_checkpoint.restore_from_path, )

And then I got the error below, any advice on how to load a pretrained non-reward model to train as a reward model in Nemo?

Error executing job with overrides: []
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 206, in load_sharded_object
    loaded_obj = torch.load(load_path)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/checkpoint/binary/train_package/train_reward_model.py", line 68, in main
    ptl_model = load_from_nemo(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 96, in load_from_nemo
    model = cls.restore_from(
  File "/checkpoint/binary/train_package/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/checkpoint/binary/train_package/nemo/core/classes/modelPT.py", line 450, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 52, in restore_from
    output = super().restore_from(*args, **kwargs)
  File "/checkpoint/binary/train_package/nemo/collections/nlp/parts/nlp_overrides.py", line 1123, in restore_from
    checkpoint = dist_checkpointing.load(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 120, in load
    sharded_objects, sharded_state_dict = load_sharded_objects(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 221, in load_sharded_objects
    return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
    return f(x)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 218, in load_sharded_object
    raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt not found

Doxie · Answer 1 · Tue Apr 02 2024 15:43:37 GMT+0800 (China Standard Time)

anybody can please help with this?

Olivier Delalleau · Answer 2 · Thu Apr 04 2024 10:20:31 GMT+0800 (China Standard Time)

Did you try with strict=False?

Gerald Shen · Answer 3 · Fri Apr 05 2024 06:06:10 GMT+0800 (China Standard Time)

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

Doxie · Answer 4 · Tue Apr 09 2024 11:02:20 GMT+0800 (China Standard Time)

Did you try with strict=False?

yes, it didn't work either

Doxie · Answer 5 · Tue Apr 09 2024 11:04:04 GMT+0800 (China Standard Time)

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

yes it is a mcore based model

Doxie · Answer 6 · Tue Apr 09 2024 11:13:01 GMT+0800 (China Standard Time)

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

Gerald Shen · Answer 7 · Tue Apr 09 2024 13:06:39 GMT+0800 (China Standard Time)

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

ah okay! that's good to know. can you elaborate on the change of megatron? was your model SFTed in a previous container?

Olivier Delalleau · Answer 8 · Tue Apr 09 2024 19:56:47 GMT+0800 (China Standard Time)

To elaborate, it'd be helpful if you could share the exact steps you used when you said "I converted a llama model to nemo", so that we can reproduce the issue. Which container did you use and which commands did you run?