[BUG] Pretrained GPT2 model has an incorrect size compared with the config file.

Question

[BUG] Pretrained GPT2 model has an incorrect size compared with the config file.

alphaGem opened this issue 2 years ago · comments

Describe the bug

  File "example.py", line 303, in <module>
    main()
  File "example.py", line 93, in main
    gpt = GPT2.from_pretrained("gpt2-base", config=gpt_config)
  File "/home/chenyanxu/miniconda3/envs/BMTrain/lib/python3.8/site-packages/model_center/model/basemodel.py", line 33, in from_pretrained
    bmt.load(model, os.path.join(path, 'pytorch_model.pt'), strict=False)
  File "/home/chenyanxu/miniconda3/envs/BMTrain/lib/python3.8/site-packages/bmtrain-0.1.8-py3.8-linux-x86_64.egg/bmtrain/store.py", line 202, in load
    ret = model.load_state_dict(
  File "/home/chenyanxu/miniconda3/envs/BMTrain/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPT2:
	size mismatch for input_embedding.weight: copying a param with shape torch.Size([50258, 768]) from checkpoint, the shape in current model is torch.Size([38597376]).

Minimal steps to reproduce

gpt_config = GPT2Config.from_pretrained("gpt2-base")
gpt = GPT2.from_pretrained("gpt2-base", config=gpt_config)

Expected behavior

Successfully loads the model.

Environment:

model-center 0.1.3, torch 0.11.0, cuda 10.2

Additional information
If I change the vocab size in my local cached config to 50258, it seems to load correctly.

Chern Yan-Shiu · Answer 1 · Thu Jul 28 2022 01:06:15 GMT+0800 (China Standard Time)

However, it seems that if I change the vocab size in my local cached config to 50258, the model doesn't work correctly because outputs[:,:,-1] are all zeros, which are significantly larger than all other values. Using slice outputs[:,:,:-1] as the real output seems to solve the problem.

MayDomine · Answer 2 · Fri Jul 29 2022 10:03:07 GMT+0800 (China Standard Time)

This is caused by the new release ,you can pull the latest code and and it would work like you expect.Also make sure you delete the .cache/model_center/gpt2-base/ dir because we updated the config json on cloud too.

Chern Yan-Shiu · Answer 3 · Fri Jul 29 2022 18:05:24 GMT+0800 (China Standard Time)

I have tried the following actions respectively:

pip uninstall model-center and then pip install model-center
pip uninstall model-center; then clone the latest code and run python3 setup.py install in the code folder

Before each time I try to run my code, I delete the ~/.cache/model_center folder.

However, none of the above actions solves the problem.

Are you sure that the pre-trained gpt-2 base model on cloud (the download path in utils/net_utils.py is https://openbmb.oss-cn-hongkong.aliyuncs.com/model_center/{path} as far as I can see) has a correct vocab size of 50257 instead of 50258?

MayDomine · Answer 4 · Mon Aug 01 2022 10:35:33 GMT+0800 (China Standard Time)

I have tried the following actions respectively:

pip uninstall model-center and then pip install model-center

pip uninstall model-center; then clone the latest code and run python3 setup.py install in the code folder

Before each time I try to run my code, I delete the ~/.cache/model_center folder.

However, none of the above actions solves the problem.

Are you sure that the pre-trained gpt-2 base model on cloud (the download path in utils/net_utils.py is https://openbmb.oss-cn-hongkong.aliyuncs.com/model_center/{path} as far as I can see) has a correct vocab size of 50257 instead of 50258?

Sorry, We didn't update the checkpoint on the cloud before,which is not compatible with the config json.The vocab size should be 50257 ,and the checkpoint before has a extra dim with all zeros.Now the issue is fixed, you can clean the .cache/checkpoint and redownload the correct checkpoint by using from_pretrained method.

MayDomine · Answer 5 · Thu Aug 18 2022 18:23:07 GMT+0800 (China Standard Time)

This issue has been fixed