intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[integration]: merging bfloat16 model failed

raj-ritu17 opened this issue · comments

base-model: Weyaxi/Dolphin2.1-OpenOrca-7B

Scenario:

(ft_Qlora) intel@imu-nex-sprx92-max1-sut:~/ritu/ipex-llm/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$ python ./export_merged_model.py --repo-id-or-model-path Weyaxi/Dolphin2.1-OpenOrca-7B --adapter_path ./out-dir-FT/tmp-checkpoint-1400/ --output_path ./out-dir-FT/tmp-checkpoint-1400-merged
/home/intel/miniconda3/envs/ft_Qlora/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2024-05-24 14:37:05,078 - INFO - intel_extension_for_pytorch auto imported
2024-05-24 14:37:05,084 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
/home/intel/miniconda3/envs/ft_Qlora/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
2024-05-24 14:37:05,713 - ERROR -

****************************Usage Error************************
Please use torch_dtype=torch.bfloat16 when setting load_in_low_bit='bf16'.
2024-05-24 14:37:05,713 - ERROR -

****************************Call Stack*************************
Failed to merge the adapter, error: Please use torch_dtype=torch.bfloat16 when setting load_in_low_bit='bf16'..

what else tried:
added 'torch_dtype=torch.bfloat16' in utils code (in function -> merge_adapter) for e.g.
--> common/utils/util.py +183

    try:
        base_model = AutoModelForCausalLM.from_pretrained(
            base_model,
            #load_in_low_bit="nf4", # should load the orignal model
            #torch_dtype=torch.float16,
            #ritu: added for DolphinOrca-7b
            **torch_dtype=torch.bfloat16**,
            #end
            device_map={"": "cpu"},
        )

        lora_model = PeftModel.from_pretrained(
            base_model,
            adapter_path,
            device_map={"": "cpu"},
            #torch_dtype=torch.float16,
            **torch_dtype=torch.bfloat16,**
        )

this doesn't solve the issue and gives an empty error.

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.78it/s]
2024-05-24 10:35:22,564 - INFO - Converting the current model to bf16 format......
[2024-05-24 10:35:22,912] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to xpu (auto detect)
Failed to merge the adapter, error: .

Hi, @raj-ritu17 , I have reproduced error during merging model. We will try to fix it, update here once it is solved.

Hi, @raj-ritu17 .
We have fixed this bug. Please install the latest ipex-llm (2.1.0b20240527), no need to modify utils code and just run this script to merge model.

According to my local experiment, this merging process works and you could use this merged model do inference following https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral