intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

finetune chatGLM6B using LoRA on arc

YongZhuIntel opened this issue · comments

we are trying to finetune chatGLM6B using LoRA on arcA770 1card and 2cards , use the following command
1card:

python ./alpaca_lora_finetuning.py \
    --base_model "/home/intel/models/chatglm3-6b" \
    --data_path "yahma/alpaca-cleaned" \
    --lora_target_modules '[query_key_value,dense,dense_h_to_4h,dense_4h_to_h]' \
    --output_dir "./ipex-llm-qlora-alpaca"

2 cards:

export MASTER_ADDR=127.0.0.1
export OMP_NUM_THREADS=6
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export TORCH_LLM_ALLREDUCE=0
mpirun -n 2 \
    python ./alpaca_lora_finetuning.py \
    --base_model "/home/intel/models/chatglm3-6b" \
    --data_path "yahma/alpaca-cleaned" \
    --lora_target_modules '[query_key_value,dense,dense_h_to_4h,dense_4h_to_h]' \
    --output_dir "./ipex-llm-qlora-alpaca"

but got the same error:

  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_finetune/lib/python3.11/site-packages/ipex_llm/transformers/low_bit_linear.py", line 913, in forward
    result = F.linear(x, self.weight, self.bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Allocation is out of device memory on current platform.

Is this command correct ?or Can you help provide the correct way to finetune chatGLM6B using LoRA on arcA770 1card and 2cards?Thanks

error log:
lora_finetune_chatglm3_6b_arc_1_card.log
lora_finetune_chatglm3_6b_arc_2_card_def.log

Hi @YongZhuIntel ,

I reproduced it and got the same error, which means that XPU memory resource on the platform has been used out.

In addition, it is profiled that chatglm3-6b model in BF16 takes ~11G+ after the trainer is started. In the following forward/backward, the memory consumption gradually grows, and thus it is easy to exceed 16G max limit on Arc.

Also, it should be noted that multi-instance training is in a data-parallel way, which loads the whole model on each card and therefore does not save any memory.

Two suggestions:

Firstly, you could try QLoRA, which quants the base model into NF4 that requires less memory than BF16. As the base model is freezed, this will not harm the tuning accuracy. Moreover, we have already validated chatglm with QLoRA. This is the most recommended.

Secondly, hyperparameters can be tuned to decrease the memory consumption. With the below configurations, I can run more than 100 steps on 2 cards. And more configurations can be tried as well:

# in alpaca_lora_finetuning.py
lora_r: int = 2,
lora_alpha: int = 4,
lora_dropout: float = 0.85,

# in .sh script
......
      python ./alpaca_lora_finetuning.py \
      --micro_batch_size 1 \
      --batch_size 2 \
......

@Uxito-Ada Thanks for your help, I has successfully run qlora_finetune_chatglm3_6b on 1card , but when tryin to run qlora_finetune_chatglm3_6b on 2 card, I got error at 100 steps.

2 cards script:

export MASTER_ADDR=127.0.0.1
export OMP_NUM_THREADS=6
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export TORCH_LLM_ALLREDUCE=0
mpirun -n 2 \
    python ./alpaca_qlora_finetuning.py \
    --base_model "/home/intel/models/chatglm3-6b" \
    --data_path "yahma/alpaca-cleaned" \
    --lora_target_modules '[query_key_value,dense,dense_h_to_4h,dense_4h_to_h]' \
    --output_dir "./ipex-llm-qlora-alpaca"

error message:

OSError: [Errno 39] Directory not empty: './ipex-llm-qlora-alpaca/tmp-checkpoint-100' -> './ipex-llm-qlora-alpaca/checkpoint-100'

and #11099 said this issue fixed on transformers 4.39.1
But After I installed transformers 4.39.1

pip install transformers==4.39.1
pip install accelerate==0.28.0

I got new error:

Traceback (most recent call last):
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_finetune/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 759, in convert_to_tensors
    tensor = as_tensor(value)
             ^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_finetune/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 721, in as_tensor
    return torch.tensor(value)
           ^^^^^^^^^^^^^^^^^^^
ValueError: expected sequence of length 256 at dim 1 (got 255)

Is there something else that needs to be installed?

error log:
qlora_finetune_chatglm3_6b_arc_2_card_def_tmp.log

Hi @YongZhuIntel ,

I reproduced your error, and the below dependencies can help to solve it:

pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install transformers==4.36.1
pip install accelerate==0.23.0

@Uxito-Ada After adding the dependencies , qlora_finetune_chatglm3_6b has successfully run on 2 ARC cards .
But LoRA finetune still doesn't work with default configurations, and the hyperparameters may affected accuracy. Is there a way to finetune chatGLM6B using LoRA on arc without affecting accuracy?

Lora chatglm3-6b is in #11266

@qiyuangong Following #11266 ,I’ve added the deepspeed_zero2.json and the lora_finetune_chatglm3_6b_arc_2_card.sh script:

 export MASTER_ADDR=127.0.0.1
 export OMP_NUM_THREADS=6
 export FI_PROVIDER=tcp
 export CCL_ATL_TRANSPORT=ofi

mpirun -n 2 \
       python -u ./alpaca_lora_finetuning.py \
       --base_model "/home/intel/models/chatglm3-6b" \
       --data_path "yahma/alpaca-cleaned" \
       --lora_target_modules '[query_key_value,dense,dense_h_to_4h,dense_4h_to_h]' \
       --output_dir "./ipex-llm-lora-alpaca" \
       --gradient_checkpointing True \
       --micro_batch_size 1 \
       --batch_size 128 \
       --deepspeed ./deepspeed_zero2.json

But The script is stuck here and cannot be executed further:

{'loss': 1.1932, 'learning_rate': 2.945697836416767e-05, 'epoch': 0.26}
  9%|▊         | 100/1164 [1:18:42<14:49:27, 50.16s/it]/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py:408: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard.
 (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/jit/fusion_pass.cpp:837.)
  query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_em
/home/intel/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py:408: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard.
 (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/jit/fusion_pass.cpp:837.)
  query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)

  0%|          | 0/125 [00:00<?, ?it/s]

error log:
lora_finetune_chatglm3_6b_arc_2_card_ds_def.log

Seem one of 2 workers stopped due to unexpected reasons, mainly OOM.

Please reduce cutoff_len to 64 if possible. Add the following parameter to the training script.

--cutoff_len 64

@qiyuangong I added --cutoff_len 64 ,but got new error:

    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 2359, in _save_checkpoint
    self._save_optimizer_and_scheduler(staging_output_dir)
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 2453, in _save_optimizer_and_scheduler
    self.model_wrapped.save_checkpoint(output_dir)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint'

error log:
lora_finetune_chatglm3_6b_arc_2_card_ds_def2.log

@qiyuangong I added --cutoff_len 64 ,but got new error:

    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 1914, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 2359, in _save_checkpoint
    self._save_optimizer_and_scheduler(staging_output_dir)
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/transformers/trainer.py", line 2453, in _save_optimizer_and_scheduler
    self.model_wrapped.save_checkpoint(output_dir)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'save_checkpoint'

error log: lora_finetune_chatglm3_6b_arc_2_card_ds_def2.log

This error is related to save checkpoint. Good news is that training is almost successful. We will fix this issue in that PR.

We will try to replace this method in transformers, and make it runnable on ARC.

Hi @YongZhuIntel , we are providing fine-tuning on ChatGLM3 6B with Deepspeed Zero3, which partitions and distributes the model across XPU cards. You can take a look at here.

@Uxito-Ada I've tried lora_deepspeed_zero3_finetune_chatglm3_6b_arc_2_card.sh, but got an error:

Traceback (most recent call last):
  File "/home/intel/zhuyong/ipex-llm/python/llm/example/GPU/LLM-Finetuning/LoRA/./alpaca_lora_finetuning.py", line 298, in <module>
    fire.Fire(train)
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/fire/cor
e.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                           File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/fire/cor
e.py", line 477, in _Fire                                                                                                 component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/fire/cor
e.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/zhuyong/ipex-llm/python/llm/example/GPU/LLM-Finetuning/LoRA/./alpaca_lora_finetuning.py", line 187
, in train
    with ds.zero.Init(config_dict_or_path=deepspeed):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/deepspee
d/runtime/zero/partition_parameters.py", line 848, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/deepspee
d/runtime/config.py", line 774, in __init__
    self._configure_train_batch_size()
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/deepspee
d/runtime/config.py", line 950, in _configure_train_batch_size
    self._batch_assertion()
  File "/home/intel/miniconda3/envs/llm_ipex2.1.10_python3.11_transformers4.36.1/lib/python3.11/site-packages/deepspee
d/runtime/config.py", line 898, in _batch_assertion
    assert train_batch == micro_batch * grad_acc * self.world_size, (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1

Hi @YongZhuIntel , This is an unsolved issue in Deepspeed community, glad to hear from you that our work-around solution can do well.

Moreover, for the future users, we support LoRA by disabling Zero3 context manager and entirely passing the control to transformers trainer as shown in #11346.

@Uxito-Ada I tried several times, but always encountered hang

{'loss': 0.8672, 'learning_rate': 2.8307299710381738e-05, 'epoch': 0.46}
{'loss': 1.0703, 'learning_rate': 2.8307008346933854e-05, 'epoch': 0.46}
{'loss': 0.9648, 'learning_rate': 2.8306716959911775e-05, 'epoch': 0.46}
{'loss': 1.0859, 'learning_rate': 2.8306425549316014e-05, 'epoch': 0.46}
{'loss': 1.0312, 'learning_rate': 2.8306134115147088e-05, 'epoch': 0.46}
 15%|█▌        | 11495/74640 [4:40:40<9567:52:39, 545.48s/it]

or OOM issues

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dg2/miniconda3/envs/llm_ipex2.1.10_python3.11_finetune/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dg2/miniconda3/envs/llm_ipex2.1.10_python3.11_finetune/lib/python3.11/site-packages/torch/nn/modules/loss.py", line 1179, in forward
    return F.cross_entropy(input, target, weight=self.weight,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dg2/miniconda3/envs/llm_ipex2.1.10_python3.11_finetune/lib/python3.11/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Allocation is out of device memory on current platform.

Is there anything we need to set?

Hi @YongZhuIntel , sorry that I cannot reproduce the issue, please check if your dependencies match the below in my environment:

CPU: i9 14900K
GPU: arc A770 * 2
Memory: 64GB
XPU memory: 16G

| 1. GPU Core Temperature    | Status: OK                                                          |
|                            | Description: All temperature sensors are healthy.                   |
|                            | Throttle Threshold: 105 Celsius Degree                              |
|                            | Shutdown Threshold: 130 Celsius Degree                              |
+----------------------------+---------------------------------------------------------------------+
| 3. GPU Power               | Status: OK                                                          |
|                            | Description: All power domains are healthy.                         |
|                            | Throttle Threshold: 300 watts                                       |
+----------------------------+---------------------------------------------------------------------+
| 6. GPU Frequency           | Status: OK                                                          |
|                            | Description: The device frequency not throttled


deepspeed==0.11.2+78c518ed
transformers==4.36.0
intel_extension_for_deepspeed==0.9.4+ec33277
ipex-llm==2.1.0b20240623
torch==2.1.0a0+cxx11.abi

Hi @YongZhuIntel , as your platform does not have enough CPU memory, I suggest to try the below script, where backend is switched from MPI to torchrun and only qkv layers are tuned in order to save memory:

lora_deepspeed_zero3_finetune_chatglm3_6b_arc_2_card.sh:

#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

export MASTER_ADDR=127.0.0.1
export MASTER_PORT=29503
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets

basekit_root=/opt/intel/oneapi
source $basekit_root/setvars.sh --force
source $basekit_root/ccl/latest/env/vars.sh --force

NUM_GPUS=2 # number of used GPU
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0 # Different from PVC

CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
    ./alpaca_lora_finetuning.py \
       --base_model "THUDM/chatglm3-6b" \
       --data_path "yahma/alpaca-cleaned" \
       --output_dir "./ipex-llm-lora-alpaca" \
       --gradient_checkpointing True \
       --lora_target_modules "['query_key_value']" \
       --micro_batch_size 1 \
       --batch_size 2 \
       --save_checkpoint False \
       --deepspeed_zero3 True

It is found that ~28.5G CPU memory, and 15.4G XPU memory per card are used and this can meet your requirement.

@Uxito-Ada I tried the save memory script , but still encountered hang issue:

{'loss': 1.2109, 'learning_rate': 2.5541672068979266e-05, 'epoch': 0.76}
{'loss': 1.293, 'learning_rate': 2.554122291222198e-05, 'epoch': 0.76}
{'loss': 0.9395, 'learning_rate': 2.5540773736790265e-05, 'epoch': 0.76}
{'loss': 0.8633, 'learning_rate': 2.5540324542684904e-05, 'epoch': 0.76}
{'loss': 0.9258, 'learning_rate': 2.5539875329906703e-05, 'epoch': 0.76}
{'loss': 1.0039, 'learning_rate': 2.553942609845645e-05, 'epoch': 0.76}
{'loss': 0.9785, 'learning_rate': 2.5538976848334947e-05, 'epoch': 0.76}
{'loss': 0.793, 'learning_rate': 2.553852757954299e-05, 'epoch': 0.76}
 25%|██▌       | 18813/74640 [8:05:58<655:14:55, 42.25s/it][2024-06-27 03:58:45,354] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 107169 closing signal SIGTERM
[2024-06-27 03:59:15,382] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 107169 via 15, forcefully exiting via 9