'intermediate_size' not set in tools/ckpts/convert_neox_to_hf.py for neox model architecture

Question

'intermediate_size' not set in tools/ckpts/convert_neox_to_hf.py for neox model architecture

jvendrow opened this issue 3 months ago · comments

Description
When converting neox models to HF format, the 'intermediate_size' argument in the GPTNeoXConfig is not explicitly set, so it defaults to 24576 as per:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/configuration_gpt_neox.py

To Reproduce
Steps to reproduce the behavior:

Train pythia-70M model
Run convertion script:

$ python ./tools/ckpts/convert_neox_to_hf.py --input_dir checkpoints/pythia-70M/global_step143000/ --config_file pythia-70m.yml --output_dir hf_model/pythia-70M --precision fp16 --architecture neox  
[2024-05-03 11:17:41,262] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)                                
Detected 'pipe-parallel-size' of 1, assuming model is saved as PipelineModule...                                                                       
> building HFTokenizer tokenizer ...
 > padded vocab (size: 50277) with 27 dummy tokens (new size: 50304) 
 0%|                                                                                                                            | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):                                                                                                                     
  File "./tools/ckpts/convert_neox_to_hf.py", line 732, in <module>                                                                                    
    main()                                                                                                                                             
  File "./tools/ckpts/convert_neox_to_hf.py", line 696, in main                                                                                        
    hf_model = convert(                                                                                                                                
  File "./tools/ckpts/convert_neox_to_hf.py", line 555, in convert                                                                                     
    hf_layer.load_state_dict(state_dict)                                                                                                               
  File "/mnt/xfs/home/jvendrow/conda_envs/pythia/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict                
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(                                                                          
RuntimeError: Error(s) in loading state_dict for GPTNeoXLayer:                                                                                         
        size mismatch for mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is t
orch.Size([24576, 512]).                                                                                                                               
        size mismatch for mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Si
ze([24576]).                                                                                                                                           
        size mismatch for mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is t
orch.Size([512, 24576]).

Proposed solution
It seems the intermediate size for neox architecture in general is 4 * hidden size. Suggested edit is to add the following for neox models:

args.update(
            {
                "intermediate_size": get_key(
                    neox_config,
                    "intermediate-size",
                    4 * get_key(neox_config, "hidden-size"),
                ),
            }
        )

Happy to make a PR.

Quentin Anthony · Answer 1 · Sun May 05 2024 02:14:48 GMT+0800 (China Standard Time)

Ah nice catch. Yes I'd welcome this PR.

Joshua Vendrow · Answer 2 · Sun May 05 2024 02:49:15 GMT+0800 (China Standard Time)

Ok great, created PR #1209.

Quentin Anthony · Answer 3 · Sun May 05 2024 02:52:53 GMT+0800 (China Standard Time)

resolved in #1209