intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chatglm3 model convert to sym_int4 failed

HoppeDeng opened this issue · comments

HW: MTL laptop
SW: bigdl-llm 2.5.0b20240402
convert cmd:
python ./convert.py --repo-id-or-model-path "C:\Users\wincg\ultraChat\cod
e\ultrachat\chatglm3-6b" --low-bit sym_int4 --save-path ./chatglm3_int4
BKM is from https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load

output log:
C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\Users\wincg\miniforge3\envs\video\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2024-06-06 09:40:21,112 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 7/7 [00:14<00:00, 2.08s/it]
2024-06-06 09:40:39,706 - INFO - Converting the current model to sym_int4 format......
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'ChatGLMTokenizer'.
The class this function is called from is 'LlamaTokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
The model 'ChatGLMForConditionalGeneration' is not supported for . Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MusicgenMelodyForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'Qwen2ForCausalLM', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'WhisperForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
C:\Users\wincg\miniforge3\envs\video\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py:377: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard.
(Triggered internally at D:/ipex-aot/compile/intel-extension-for-pytorch/csrc/gpu/jit/fusion_pass.cpp:837.)
query_layer = apply_rotary_pos_emb_chatglm(query_layer, rotary_pos_emb)
Traceback (most recent call last):
File "C:\Users\wincg\ultraChat\code\ultrachat\trunk\convert.py", line 32, in
output = pipeline(input_str)[0]["generated_text"]
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\transformers\pipelines\text_generation.py", line 240, in call
return super().call(text_inputs, **kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\transformers\pipelines\base.py", line 1206, in call
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\transformers\pipelines\base.py", line 1213, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\transformers\pipelines\base.py", line 1112, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\transformers\pipelines\text_generation.py", line 327, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\transformers\generation\utils.py", line 1527, in generate
result = self._greedy_search(
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\transformers\generation\utils.py", line 2411, in _greedy_search
outputs = self(
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wincg.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 937, in forward
transformer_outputs = self.transformer(
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 167, in chatglm2_model_forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wincg.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 640, in forward
layer_ret = layer(
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wincg.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 544, in forward
attention_output, kv_cache = self.self_attention(
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 191, in chatglm2_attention_forward
return forward_function(
File "C:\Users\wincg\miniforge3\envs\video\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 377, in chatglm2_attention_forward_8eb45c
query_layer = apply_rotary_pos_emb_chatglm(query_layer, rotary_pos_emb)
NotImplementedError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Could not run 'torch_ipex::mul_add' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'torch_ipex::mul_add' is only available for these backends: [XPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

XPU: registered at D:/ipex-aot/compile/intel-extension-for-pytorch/csrc/gpu/aten/operators/TripleOps.cpp:640 [kernel]
BackendSelect: fallthrough registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:153 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\functorch\DynamicLayer.cpp:498 [backend fallback]
Functionalize: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\FunctionalizeFallbackKernel.cpp:290 [backend fallback]
Named: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\native\NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:86 [backend fallback]
AutogradOther: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:53 [backend fallback]
AutogradCPU: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:57 [backend fallback]
AutogradCUDA: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:65 [backend fallback]
AutogradXLA: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:69 [backend fallback]
AutogradMPS: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:77 [backend fallback]
AutogradXPU: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:61 [backend fallback]
AutogradHPU: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:90 [backend fallback]
AutogradLazy: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:73 [backend fallback]
AutogradMeta: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\VariableFallbackKernel.cpp:81 [backend fallback]
Tracer: registered at D:\ipex-aot\compile\pytorch\torch\csrc\autograd\TraceTypeManual.cpp:296 [backend fallback]
AutocastCPU: fallthrough registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\autocast_mode.cpp:382 [backend fallback]
AutocastXPU: registered at D:/ipex-aot/compile/intel-extension-for-pytorch/csrc/gpu/aten/operators/TripleOps.cpp:640 [kernel]
AutocastCUDA: fallthrough registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\autocast_mode.cpp:249 [backend fallback]
FuncTorchBatched: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\functorch\LegacyBatchingRegistrations.cpp:710 [backend fallback]
FuncTorchVmapMode: fallthrough registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\functorch\VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\functorch\TensorWrapper.cpp:203 [backend fallback]
PythonTLSSnapshot: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:161 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\functorch\DynamicLayer.cpp:494 [backend fallback]
PreDispatch: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:165 [backend fallback]
PythonDispatcher: registered at D:\ipex-aot\compile\pytorch\aten\src\ATen\core\PythonFallbackKernel.cpp:157 [backend fallback]

Hi @HoppeDeng ,

It seems like you are conducting model generation on CPU in a GPU environment. Would you mind having a try on this example for GPU: https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load?

Please let us know for any further problems :)

Yes, use GPU script is OK, Thanks for your quick reply!