[Bug] checkpoint load bug

Question

[Bug] checkpoint load bug

TousenKaname opened this issue 5 months ago · comments

Guoan Wang commented 5 months ago

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

Versions
PyTorch version: 2.1.2
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.17

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7H12 64-Core Processor
Stepping: 0
CPU MHz: 2600.000
CPU max MHz: 2600.0000
CPU min MHz: 1500.0000
BogoMIPS: 5200.18
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca

Versions of relevant libraries:
[pip3] numpy==1.23.4
[pip3] torch==2.1.2
[pip3] torchvision==0.16.2
[pip3] triton==2.1.0
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.8 py310h5eee18b_0
[conda] mkl_random 1.2.4 py310hdb19cb5_0
[conda] numpy 1.23.4 pypi_0 pypi
[conda] pytorch 2.1.2 py3.10_cuda12.1_cudnn8.9.2_0 pytorch
[conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchtriton 2.1.0 py310 pytorch
[conda] torchvision 0.16.2 py310_cu121 pytorch

Reproduces the problem - code/configuration sample

from copy import deepcopy
from mmengine.config import read_base

with read_base():
    # from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
    # from .datasets.agieval.agieval_gen_64afd3 import agieval_datasets
    # from .datasets.bbh.bbh_gen_5b92b0 import bbh_datasets
    # from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
    # from .datasets.math.math_evaluatorv2_gen_265cce import math_datasets
    # from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
    # from .datasets.mbpp.sanitized_mbpp_gen_1e1056 import sanitized_mbpp_datasets
    from .datasets.MedBench.medbench_gen_0b4fff import medbench_datasets

    from .models.hf_internlm.hf_internlm2_chat_7b import models as hf_internlm2_chat_7b_model
    # from .models.hf_internlm.hf_internlm2_chat_20b import models as hf_internlm2_chat_20b_model

    from .summarizers.internlm2_keyset import summarizer

work_dir = './outputs/internlm2-chat-keyset/'

_origin_datasets = sum([v for k, v in locals().items() if k.endswith("_datasets")], [])
_origin_models = sum([v for k, v in locals().items() if k.endswith("_model")], [])

_vanilla_datasets = [deepcopy(d) for d in _origin_datasets]
_vanilla_models = []
for m in _origin_models:
    m = deepcopy(m)
    if 'meta_template' in m and 'round' in m['meta_template']:
        round = m['meta_template']['round']
        if any(r['role'] == 'SYSTEM' for r in round):
            new_round = [r for r in round if r['role'] != 'SYSTEM']
            print(f'WARNING: remove SYSTEM round in meta_template for {m.get("abbr", None)}')
            m['meta_template']['round'] = new_round
    _vanilla_models.append(m)


datasets = _vanilla_datasets
models = _vanilla_models

Reproduces the problem - command or script

当我第一次运行opencompass，会自动下载 checkpoint，然后可以跑，但是当我第二次运行，读取 checkpoint 的时候，就会连续报三个错误。我怀疑是 bug，并且我已经测试过了直接 hf 推理，下载好了权重也是 ok 的

Reproduces the problem - error message

01/23 16:47:46 - OpenCompass - DEBUG - An `OpenICLInferTask` instance is built from registry, and its implementation can be found in opencompass.tasks.openicl_infer                                                                                                                      
01/23 16:47:46 - OpenCompass - WARNING - Only use 1 GPUs for total 2 available GPUs in debug mode.                                                                                                                                                                                        
01/23 16:48:02 - OpenCompass - INFO - Task [internlm2-chat-7b-hf/medbench-Med-Exam_0]                                                                                                                                                                                                     
Loading checkpoint shards:   0%|          | 0/8 [00:01<?, ?it/s]                                                                                                                                                                                                                          
Traceback (most recent call last):                                                                                                                                                                                                                                                        
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/transformers/modeling_utils.py", line 533, in load_state_dict                                                                                                                                    
    return torch.load(                                                                                                                                                                                                                                                                    
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/utils/fileio.py", line 104, in load                                                                                                                                                                                      
    return load._fallback(f, *args, **kwargs)                                                                                                                                                                                                                                             
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/serialization.py", line 1002, in load                                                                                                                                                      
    raise ValueError("f must be a string filename in order to use mmap argument")                                                                                                                                                                                                         
ValueError: f must be a string filename in order to use mmap argument                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                          
During handling of the above exception, another exception occurred:                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                          
Traceback (most recent call last):                                                                                                                                                                                                                                                        
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/transformers/modeling_utils.py", line 542, in load_state_dict                                                                                                                                    
    if f.read(7) == "version":                                                                                                                                                                                                                                                            
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/codecs.py", line 322, in decode                                                                                                                                                                                
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                                                                                                                                                                                                    
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                          
During handling of the above exception, another exception occurred:                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                          
Traceback (most recent call last):                                                                                                                                                                                                                                                        
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/tasks/openicl_infer.py", line 153, in <module>                                                                                                                                                                           
    inferencer.run()                                                                                                                                                                                                                                                                      
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/tasks/openicl_infer.py", line 65, in run                                                                                                                                                                                 
    self.model = build_model_from_cfg(model_cfg)                                                                                                                                                                                                                                          
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/utils/build.py", line 25, in build_model_from_cfg                                                                                                                                                                        
    return MODELS.build(model_cfg)                                                                                                                                                                                                                                                        
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build                                                                                                                                               
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/models/huggingface.py", line 126, in __init__                                                                                                                                                                            
    self._load_model(path=path,
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/models/huggingface.py", line 676, in _load_model
    self.model = AutoModelForCausalLM.from_pretrained(path, **model_kwargs)
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/utils/fileio.py", line 162, in auto_pt                                                                                                                                                                                   
    res = ori_auto_pt.__func__(cls, pretrained_model_name_or_path,                                                                           
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/utils/fileio.py", line 138, in model_pt
    res = ori_model_pt.__func__(cls, pretrained_model_name_or_path,
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4261, in _load_pretrained_model
    state_dict = load_state_dict(shard_file)        
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/transformers/modeling_utils.py", line 554, in load_state_dict
    raise OSError(             
OSError: Unable to load weights from pytorch checkpoint file for '/mnt/petrelfs/wangguoan/.cache/huggingface/hub/models--internlm--internlm2-chat-7b/snapshots/2292b86b21cb856642782cebed0a453997453b1f/pytorch_model-00001-of-00008.bin' at '/mnt/petrelfs/wangguoan/.cache/huggingface/h
ub/models--internlm--internlm2-chat-7b/snapshots/2292b86b21cb856642782cebed0a453997453b1f/pytorch_model-00001-of-00008.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
[2024-01-23 16:48:16,279] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 33274) of binary: /mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/bin/python
Traceback (most recent call last):                                
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)                  
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)                   
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(        
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/petrelfs/wangguoan/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(  
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/tasks/openicl_infer.py FAILED
------------------------------------------------------------
Failures:               
  <NO_OTHER_FAILURES>    
------------------------------------------------------------
Root Cause (first observed failure):
[0]:                               
  time      : 2024-01-23_16:48:16
  host      : SH-IDCA1404-10-140-54-14
  rank      : 0 (local_rank: 0)                       
  exitcode  : 1 (pid: 33274)
  error_file: <N/A>            
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Other information

No response

Haonan Li · Answer 1 · Thu Jan 25 2024 03:34:27 GMT+0800 (China Standard Time)

I got the same error.

Haonan Li · Answer 2 · Thu Jan 25 2024 03:49:54 GMT+0800 (China Standard Time)

I just solve the problem by installing torch-2.0.0, I guess this may caused by the latest torch.

Guoan Wang · Answer 3 · Thu Jan 25 2024 09:55:37 GMT+0800 (China Standard Time)

I just solve the problem by installing torch-2.0.0, I guess this may caused by the latest torch.

Thanks！ You saved my day!

zhulinJulia24 · Answer 4 · Mon Jan 29 2024 20:38:02 GMT+0800 (China Standard Time)

I met the same error by install transformers==4.33.0.

chao_xlc · Answer 5 · Fri Mar 01 2024 20:03:01 GMT+0800 (China Standard Time)

I encountered the same issue because there is a condition check in /site-packages/transformers/modeling_utils.py, line 522

if (
isinstance(checkpoint_file, str)
and map_location != "meta"
and version.parse(torch.version) >= version.parse("2.1.0")
and is_zipfile(checkpoint_file)
):
extra_args = {"mmap": False}

which sets mmap to True. Afterwards, opencompass modifies the input model_path, resulting in an error in PyTorch's recognition. I am inclined to believe that this is a bug and I would appreciate a reasonable explanation. Thank you.

The model is MobiLlama-05B-Chat,
My solution is to modify line 528 to extra_args = {"mmap": False}, or alternatively, modify the Python version check in the conditional statement.

Rene · Answer 6 · Fri Mar 08 2024 05:20:49 GMT+0800 (China Standard Time)

I encountered the same issue because there is a condition check in /site-packages/transformers/modeling_utils.py, line 522

if (
isinstance(checkpoint_file, str)
and map_location != "meta"
and version.parse(torch.version) >= version.parse("2.1.0")
and is_zipfile(checkpoint_file)
):
extra_args = {"mmap": False}

which sets mmap to True. Afterwards, opencompass modifies the input model_path, resulting in an error in PyTorch's recognition. I am inclined to believe that this is a bug and I would appreciate a reasonable explanation. Thank you.

The model is MobiLlama-05B-Chat,
My solution is to modify line 528 to extra_args = {"mmap": False}, or alternatively, modify the Python version check in the conditional statement.

works like charm