'intermediate_size' not set in tools/ckpts/convert_neox_to_hf.py for neox model architecture
jvendrow opened this issue · comments
Description
When converting neox models to HF format, the 'intermediate_size' argument in the GPTNeoXConfig is not explicitly set, so it defaults to 24576 as per:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/configuration_gpt_neox.py
To Reproduce
Steps to reproduce the behavior:
- Train pythia-70M model
- Run convertion script:
$ python ./tools/ckpts/convert_neox_to_hf.py --input_dir checkpoints/pythia-70M/global_step143000/ --config_file pythia-70m.yml --output_dir hf_model/pythia-70M --precision fp16 --architecture neox
[2024-05-03 11:17:41,262] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Detected 'pipe-parallel-size' of 1, assuming model is saved as PipelineModule...
> building HFTokenizer tokenizer ...
> padded vocab (size: 50277) with 27 dummy tokens (new size: 50304)
0%| | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
File "./tools/ckpts/convert_neox_to_hf.py", line 732, in <module>
main()
File "./tools/ckpts/convert_neox_to_hf.py", line 696, in main
hf_model = convert(
File "./tools/ckpts/convert_neox_to_hf.py", line 555, in convert
hf_layer.load_state_dict(state_dict)
File "/mnt/xfs/home/jvendrow/conda_envs/pythia/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPTNeoXLayer:
size mismatch for mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is t
orch.Size([24576, 512]).
size mismatch for mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Si
ze([24576]).
size mismatch for mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is t
orch.Size([512, 24576]).
Proposed solution
It seems the intermediate size for neox architecture in general is 4 * hidden size. Suggested edit is to add the following for neox models:
args.update(
{
"intermediate_size": get_key(
neox_config,
"intermediate-size",
4 * get_key(neox_config, "hidden-size"),
),
}
)
Happy to make a PR.
Ah nice catch. Yes I'd welcome this PR.
Ok great, created PR #1209.
resolved in #1209