[inference]: fine tuned model fails to do inferencing
raj-ritu17 opened this issue · comments
Scenario:
- completed the fine tune on 'Weyaxi/Dolphin2.1-OpenOrca-7B' using ipex-llm on gpu max 1100
- output directory look like as below with checkpoints and config file.
- made changes to inference file and removed training parameter from the 'adapter_config.json' to do the inference
- ran following command to perform the inference:
- accelerate launch -m inference lora.yml --lora_model_dir="./qlora-out/"
after submitting instruction below issue occurred [ it also says certain quantization is not applicable on CPU, while we are running on GPU and did the FT on GPU"]
logs
(ft_llm) intel@imu-nex-sprx92-max1-sut:~/ritu/axolotl$ accelerate launch -m inference lora.yml --lora_model_dir="./qlora-out/"
/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvi sion.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvi sion.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2024-05-23 11:31:52,323 - INFO - intel_extension_for_pytorch auto imported
2024-05-23 11:31:52,341 - WARNING - The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always r esume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
[2024-05-23 11:31:55,282] [INFO] [axolotl.normalize_config:169] [PID:461596] [RANK:0] GPU memory usage baseline: 0.000GB ()
[2024-05-23 11:31:55,283] [INFO] [axolotl.common.cli.load_model_and_tokenizer:49] [PID:461596] [RANK:0] loading tokenizer... Weyaxi/Dolphin2.1-OpenOrca-7B
[2024-05-23 11:31:55,704] [DEBUG] [axolotl.load_tokenizer:216] [PID:461596] [RANK:0] EOS: 2 / </s>
[2024-05-23 11:31:55,704] [DEBUG] [axolotl.load_tokenizer:217] [PID:461596] [RANK:0] BOS: 1 / <s>
[2024-05-23 11:31:55,704] [DEBUG] [axolotl.load_tokenizer:218] [PID:461596] [RANK:0] PAD: 2 / </s>
[2024-05-23 11:31:55,704] [DEBUG] [axolotl.load_tokenizer:219] [PID:461596] [RANK:0] UNK: 0 / <unk>
[2024-05-23 11:31:55,704] [INFO] [axolotl.load_tokenizer:224] [PID:461596] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-05-23 11:31:55,704] [INFO] [axolotl.common.cli.load_model_and_tokenizer:51] [PID:461596] [RANK:0] loading model and (optionally) peft_config...
/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/transformers/model.py:204: FutureWarning: BigDL LLM QLoRA does not support double quant now, set to False
warnings.warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3.78it/s]
[2024-05-23 11:33:26,633] [INFO] [axolotl.load_model:665] [PID:461596] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2024-05-23 11:33:26,636] [INFO] [axolotl.load_model:677] [PID:461596] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-05-23 11:33:26,850] [INFO] [axolotl.load_lora:789] [PID:461596] [RANK:0] found linear modules: ['up_proj', 'down_proj', 'o_proj', 'k_proj', 'v_proj', 'q_proj', 'gate_proj']
[2024-05-23 11:33:26,851] [DEBUG] [axolotl.load_lora:808] [PID:461596] [RANK:0] Loading pretained PEFT - LoRA
trainable params: 41,943,040 || all params: 4,012,118,016 || trainable%: 1.0454089294665454
[2024-05-23 11:33:27,535] [INFO] [axolotl.load_model:714] [PID:461596] [RANK:0] GPU memory usage after adapters: 0.000GB ()
================================================================================
Give me an instruction (Ctrl + D to submit):
hello, test
========================================
<s>hello, Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/intel/ritu/axolotl/inference.py", line 41, in <module>
fire.Fire(do_cli)
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/ritu/axolotl/inference.py", line 37, in do_cli
do_inference(cfg=parsed_cfg, cli_args=parsed_cli_args)
File "/home/intel/ritu/axolotl/src/axolotl/cli/__init__.py", line 153, in do_inference
generated = model.generate(
^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/peft/peft_model.py", line 1190, in generate
outputs = self.base_model.generate(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/transformers/lookup.py", line 87, in generate
return original_generate(self,
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/transformers/speculative.py", line 109, in generate
return original_generate(self,
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/generation/utils.py", line 1764, in generate
return self.sample(
^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/generation/utils.py", line 2861, in sample
outputs = self(
^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py", line 1044, in forward
outputs = self.model(
^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py", line 929, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py", line 654, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py", line 255, in forward
query_states = self.q_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/peft/tuners/lora/layer.py", line 497, in forward
result = self.base_layer(x, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/transformers/low_bit_linear.py", line 720, in forward
invalidInputError(self.qtype != NF3 and self.qtype != NF4 and self.qtype != FP8E4
File "/home/intel/miniconda3/envs/ft_llm/lib/python3.11/site-packages/ipex_llm/utils/common/log4Error.py", line 32, in invalidInputError
raise RuntimeError(errMsg)
RuntimeError: NF3, NF4, FP4 and FP8 quantization are currently not supported on CPU
inference file content:
#"""
#CLI to run inference on a trained model
#"""
# ritu - added
from ipex_llm import llm_patch
llm_patch(train=True)
#end
from pathlib import Path
import fire
import transformers
from axolotl.cli import (
do_inference,
do_inference_gradio,
load_cfg,
print_axolotl_text_art,
)
from axolotl.common.cli import TrainerCliArgs
def do_cli(config: Path = Path("examples/"), gradio=False, **kwargs):
# pylint: disable=duplicate-code
print_axolotl_text_art()
parsed_cfg = load_cfg(config, **kwargs)
parsed_cfg.sample_packing = False
parser = transformers.HfArgumentParser((TrainerCliArgs))
parsed_cli_args, _ = parser.parse_args_into_dataclasses(
return_remaining_strings=True
)
parsed_cli_args.inference = True
if gradio:
do_inference_gradio(cfg=parsed_cfg, cli_args=parsed_cli_args)
else:
do_inference(cfg=parsed_cfg, cli_args=parsed_cli_args)
if __name__ == "__main__":
fire.Fire(do_cli)
adapter_config
{
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "Weyaxi/Dolphin2.1-OpenOrca-7B",
"bias": "none",
"fan_in_fan_out": null,
"inference_mode": false,
"init_lora_weights": true,
"layer_replication": null,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 16,
"lora_dropout": 0.05,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": null,
"peft_type": "LORA",
"r": 16,
"rank_pattern": {},
"revision": null,
"target_modules": [
"gate_proj",
"q_proj",
"o_proj",
"k_proj",
"up_proj",
"v_proj",
"down_proj"
],
"task_type": "CAUSAL_LM",
"use_dora": false,
"use_rslora": false
}
Hi, @raj-ritu17
Could you please try the latest ipex-llm (2.1.0b20240527) and merge the adapter to original model as we discussed in #11135 ? Then you could use the merged model do inference, following https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral for example.