intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Home Page:https://ipex-llm.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[inference]: fine tuned model fails to do inferencing

raj-ritu17 opened this issue · comments

Scenario:

  • completed the fine tune on 'Weyaxi/Dolphin2.1-OpenOrca-7B' using ipex-llm on gpu max 1100
  • output directory look like as below with checkpoints and config file.

image

  • made changes to inference file and removed training parameter from the 'adapter_config.json' to do the inference
  • ran following command to perform the inference:
    - accelerate launch -m inference lora.yml --lora_model_dir="./qlora-out/"

after submitting instruction below issue occurred [ it also says certain quantization is not applicable on CPU, while we are running on GPU and did the FT on GPU"]


logs

(ft_llm) intel@imu-nex-sprx92-max1-sut:~/ritu/axolotl$ accelerate launch -m inference lora.yml --lora_model_dir="./qlora-out/"
/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvi                                                                                sion.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvi                                                                                sion.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2024-05-23 11:31:52,323 - INFO - intel_extension_for_pytorch auto imported
2024-05-23 11:31:52,341 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
                                 dP            dP   dP
                                 88            88   88
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP



/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always r                                                                                esume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2024-05-23 11:31:55,282] [INFO] [axolotl.normalize_config:169] [PID:461596] [RANK:0] GPU memory usage baseline: 0.000GB ()
[2024-05-23 11:31:55,283] [INFO] [axolotl.common.cli.load_model_and_tokenizer:49] [PID:461596] [RANK:0] loading tokenizer... Weyaxi/Dolphin2.1-OpenOrca-7B
[2024-05-23 11:31:55,704] [DEBUG] [axolotl.load_tokenizer:216] [PID:461596] [RANK:0] EOS: 2 / </s>
[2024-05-23 11:31:55,704] [DEBUG] [axolotl.load_tokenizer:217] [PID:461596] [RANK:0] BOS: 1 / <s>
[2024-05-23 11:31:55,704] [DEBUG] [axolotl.load_tokenizer:218] [PID:461596] [RANK:0] PAD: 2 / </s>
[2024-05-23 11:31:55,704] [DEBUG] [axolotl.load_tokenizer:219] [PID:461596] [RANK:0] UNK: 0 / <unk>
[2024-05-23 11:31:55,704] [INFO] [axolotl.load_tokenizer:224] [PID:461596] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-05-23 11:31:55,704] [INFO] [axolotl.common.cli.load_model_and_tokenizer:51] [PID:461596] [RANK:0] loading model and (optionally) peft_config...
/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/transformers/model.py:204: FutureWarning: BigDL LLM QLoRA does not support double quant now, set to False
  warnings.warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.78it/s]
[2024-05-23 11:33:26,633] [INFO] [axolotl.load_model:665] [PID:461596] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2024-05-23 11:33:26,636] [INFO] [axolotl.load_model:677] [PID:461596] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-05-23 11:33:26,850] [INFO] [axolotl.load_lora:789] [PID:461596] [RANK:0] found linear modules: ['up_proj', 'down_proj', 'o_proj', 'k_proj', 'v_proj', 'q_proj', 'gate_proj']
[2024-05-23 11:33:26,851] [DEBUG] [axolotl.load_lora:808] [PID:461596] [RANK:0] Loading pretained PEFT - LoRA
trainable params: 41,943,040 || all params: 4,012,118,016 || trainable%: 1.0454089294665454
[2024-05-23 11:33:27,535] [INFO] [axolotl.load_model:714] [PID:461596] [RANK:0] GPU memory usage after adapters: 0.000GB ()
================================================================================
Give me an instruction (Ctrl + D to submit):
hello, test

========================================
<s>hello, Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/intel/ritu/axolotl/inference.py", line 41, in <module>
    fire.Fire(do_cli)
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/ritu/axolotl/inference.py", line 37, in do_cli
    do_inference(cfg=parsed_cfg, cli_args=parsed_cli_args)
  File "/home/intel/ritu/axolotl/src/axolotl/cli/__init__.py", line 153, in do_inference
    generated = model.generate(
                ^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/peft/peft_model.py", line 1190, in generate
    outputs = self.base_model.generate(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/transformers/lookup.py", line 87, in generate
    return original_generate(self,
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/transformers/speculative.py", line 109, in generate
    return original_generate(self,
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/generation/utils.py", line 1764, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/generation/utils.py", line 2861, in sample
    outputs = self(
              ^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py", line 1044, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py", line 929, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py", line 654, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py", line 255, in forward
    query_states = self.q_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/peft/tuners/lora/layer.py", line 497, in forward
    result = self.base_layer(x, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/transformers/low_bit_linear.py", line 720, in forward
    invalidInputError(self.qtype != NF3 and self.qtype != NF4 and self.qtype != FP8E4
  File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/utils/common/log4Error.py", line 32, in invalidInputError
    raise RuntimeError(errMsg)
RuntimeError: NF3, NF4, FP4 and FP8 quantization are currently not supported on CPU

inference file content:

#"""
#CLI to run inference on a trained model
#"""

# ritu - added
from ipex_llm import llm_patch
llm_patch(train=True)
#end

from pathlib import Path
import fire
import transformers

from axolotl.cli import (
    do_inference,
    do_inference_gradio,
    load_cfg,
    print_axolotl_text_art,
)
from axolotl.common.cli import TrainerCliArgs


def do_cli(config: Path = Path("examples/"), gradio=False, **kwargs):
    # pylint: disable=duplicate-code
    print_axolotl_text_art()
    parsed_cfg = load_cfg(config, **kwargs)
    parsed_cfg.sample_packing = False
    parser = transformers.HfArgumentParser((TrainerCliArgs))
    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
        return_remaining_strings=True
    )
    parsed_cli_args.inference = True

    if gradio:
        do_inference_gradio(cfg=parsed_cfg, cli_args=parsed_cli_args)
    else:
        do_inference(cfg=parsed_cfg, cli_args=parsed_cli_args)


if __name__ == "__main__":
    fire.Fire(do_cli)

adapter_config

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "Weyaxi/Dolphin2.1-OpenOrca-7B",
  "bias": "none",
  "fan_in_fan_out": null,
  "inference_mode": false,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 16,
  "lora_dropout": 0.05,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 16,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "gate_proj",
    "q_proj",
    "o_proj",
    "k_proj",
    "up_proj",
    "v_proj",
    "down_proj"
  ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": false
}

Hi, @raj-ritu17
Could you please try the latest ipex-llm (2.1.0b20240527) and merge the adapter to original model as we discussed in #11135 ? Then you could use the merged model do inference, following https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral for example.