ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices

Question

ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices

chintan-donda opened this issue a year ago · comments

Getting below error when trying to finetune the model.

Converted as Half.
trainable params: 8355840 || all params: 1075691520 || trainable%: 0.7767877541695225
Found cached dataset json (/home/users/users/.cache/huggingface/datasets/json/default-7089e4ef944c023b/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.48it/s]
Loading cached split indices for dataset at /home/users/users/.cache/huggingface/datasets/json/default-7089e4ef944c023b/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-a03d095090258b35.arrow and /home/users/users/.cache/huggingface/datasets/json/default-7089e4ef944c023b/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-f83f741993333274.arrow
Run eval every 6 steps                                                                                                                                                  
Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
PyTorch: setting up devices

Traceback (most recent call last):
  File "/home/users/users/falcontune/venv_falcontune/bin/falcontune", line 33, in <module>
    sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
  File "/home/users/users/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/run.py", line 87, in main
    args.func(args)
  File "/home/users/users/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/finetune.py", line 116, in finetune
    training_arguments = transformers.TrainingArguments(
  File "<string>", line 111, in __init__
  File "/home/users/users/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/training_args.py", line 1338, in __post_init__
    raise ValueError(
ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices.

Experimental setup details:
OS: Ubuntu 18.04.5 LTS
GPU: Tesla V100-SXM2-32GB
Libs:

bitsandbytes==0.39.0
transformers==4.29.2
triton==2.0.0
sentencepiece==0.1.99
datasets==2.12.0
peft==0.3.0
torch==2.0.1+cu118
accelerate==0.19.0
safetensors==0.3.1
einops==0.6.1
wandb==0.15.3
bitsandbytes==0.39.0
scipy==1.10.1

Finetuning command:

falcontune finetune \
    --model="falcon-40b-instruct-4bit" \
    --weights="./gptq_model-4bit--1g.safetensors" \
    --dataset="./alpaca_cleaned.json" \
    --data_type="alpaca" \
    --lora_out_dir="./falcon-40b-instruct-4bit-alpaca/" \
    --mbatch_size=1 \
    --batch_size=2 \
    --epochs=$epochs \
    --lr=3e-4 \
    --cutoff_len=256 \
    --lora_r=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --warmup_steps=5 \
    --save_steps=50 \
    --save_total_limit=3 \
    --logging_steps=5 \
    --target_modules='["query_key_value"]' \
    --backend="triton"

Any help please?

Wu Yongkang · Answer 1 · Mon Jun 26 2023 17:41:13 GMT+0800 (China Standard Time)

To train with V100, we need enable_mem_efficient, otherwise, the above error is shown.

--- a/falcontune/model/falcon/model.py
+++ b/falcontune/model/falcon/model.py
@@ -523,7 +523,7 @@ class Attention40B(nn.Module):
             key_layer_ = key_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)
             value_layer_ = value_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)

-            with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
+            with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=True):
                 attn_output = F.scaled_dot_product_attention(
                     query_layer_, key_layer_, value_layer_, None, 0.0, is_causal=True
                 )