[Bug] npu训练生成的模型转换为hf格式报错

Question

[Bug] npu训练生成的模型转换为hf格式报错

forest-sys opened this issue 7 months ago · comments

描述该错误

python transformers/convert2hf_internlm2.py --src /dev/shm/llm_ckpts/20 --tgt /dev/shm/llm_ckpts/hf/ --tokenizer ./tools/tokenizer_internlm2.model --max_pos 4096 --rotary_type origin
报错如下：

环境信息

Python 3.9.18 (main, Sep 11 2023, 13:51:18)
[GCC 11.2.0] :: Anaconda, Inc. on linux

其他信息

No response

forest-sys · Answer 1 · Fri May 24 2024 14:52:00 GMT+0800 (China Standard Time)

原始模型15G，训练输出模型有88G，模型转换前有150G左右空余空间

forest-sys · Answer 2 · Fri May 24 2024 15:17:04 GMT+0800 (China Standard Time)

训练并行配置：
parallel = dict(
zero1=dict(size=2),
tensor=dict(size=1, mode="mtp"),
pipeline=dict(size=4, interleaved_overlap=True),
weight=dict(size=1, overlap=True, memory_pool=True),
)

图片显示不清楚，报错信息如下：
-------------- Arguments --------------
Source Path: /dev/shm/llm_ckpts/20
Target Path: /dev/shm/llm_ckpts/hf/
Dtype: bfloat16
Max Shard Size: 10GB
Max Position Embedding: 4096
Tokenizer Path: ./tools/tokenizer_internlm2.model
Rotary Type: origin
Scaling Factor: 2.0

Config loading
2024-05-24 07:14:39,274 WARNING storage_manager.py:329 in try_get_storage_backend -- path: '/dev/shm/llm_ckpts/20/model_config.pt' not start with backend prefix, guess it is the backend of local.
2024-05-24 07:14:39,274 WARNING storage_manager.py:329 in try_get_storage_backend -- path: '/dev/shm/llm_ckpts/20/model_config.pt' not start with backend prefix, guess it is the backend of local.
{'checkpoint': 0, 'num_chunks': 1, 'num_attention_heads': 32, 'embed_split_hidden': True, 'vocab_size': 92544, 'embed_grad_scale': 1, 'parallel_output': False, 'hidden_size': 4096, 'num_layers': 32, 'no_bias': True, 'mlp_ratio': 3.5, 'apply_post_layer_norm': False, 'dtype': torch.float16, 'norm_type': 'rmsnorm', 'layer_norm_epsilon': 1e-05, 'num_kv_attention_heads': 8, 'use_flash_attn': False, 'mlp_layer_fusion': False}
Config loaded.
2024-05-24 07:14:39,275 WARNING storage_manager.py:329 in try_get_storage_backend -- path: '/dev/shm/llm_ckpts/20' not start with backend prefix, guess it is the backend of local.
Source Checkpoint Loading
0%| | 0/1 [00:00<?, ?it/s2024-05-24 07:14:39,277 WARNING storage_manager.py:329 in try_get_storage_backend -- path: '/dev/shm/llm_ckpts/20/model_tp0_pp0.pt' not start with backend prefix, guess it is the backend of local. | 0/4 [00:00<?, ?it/s]
2024-05-24 07:14:39,586██WARNING storage_manager.py:329 in try_get_storage_backend -- path: '/dev/shm/llm_ckpts/20/model_tp0_pp1.pt' not start with backend prefix, guess it is the backend of local. | 1/4 [00:00<00:00, 3.24it/s]
2024-05-24 07:14:39,807██WARNING storage_manager.py:329 in try_get_storage_backend -- path: '/dev/shm/llm_ckpts/20/model_tp0_pp2.pt' not start with backend prefix, guess it is the backend of local. | 2/4 [00:00<00:00, 3.89it/s]
2024-05-24 07:14:40,025██WARNING storage_manager.py:329 in try_get_storage_backend -- path: '/dev/shm/llm_ckpts/20/model_tp0_pp3.pt' not start with backend prefix, guess it is the backend of local. | 3/4 [00:00<00:00, 4.18it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.95it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.36s/it]
Source Checkpoint Loaded
Pipeline Merging
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1124.18it/s]
Pipeline Merged
test----------------------------------num_shards... 1
Converting to huggingface format...
Start converting...
0%| | 0/32 [00:00<?, ?it/s]Segmentation fault (core dumped)

Yang Gao · Answer 3 · Mon May 27 2024 15:58:32 GMT+0800 (China Standard Time)

It was most likely caused by insufficient share memory on your machine.