rmihaylov / falcontune

Tune any FALCON in 4-bit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: No available kernel. Aborting execution.

RealCalumPlays opened this issue · comments

Any ideas? Full log below:

Traceback (most recent call last):
File "/home/cosmos/miniconda3/envs/ftune/bin/falcontune", line 33, in
sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/run.py", line 87, in main
args.func(args)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/finetune.py", line 162, in finetune
trainer.train()
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 2767, in compute_loss
outputs = model(**inputs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
return self.base_model(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1070, in forward
transformer_outputs = self.transformer(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 965, in forward
outputs = block(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 698, in forward
attn_outputs = self.self_attention(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 337, in forward
attn_output = F.scaled_dot_product_attention(
RuntimeError: No available kernel. Aborting execution.

EDIT: CUDA is installed in kernel modules, on the system & in the environment just to rule out that. Using python 3.10.6

same error here on Tesla V100-SXM2-32GB

There is a choice of three kernels:

torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False

Currently, only flash attention is on. Try enabling the other options as well.

same error here on Tesla V100-SXM2-32GB

Same issue for me as well on the same machine, with below details:
OS: Ubuntu 18.04.5 LTS
Libs:

bitsandbytes==0.39.0
transformers==4.29.2
triton==2.0.0
sentencepiece==0.1.99
datasets==2.12.0
peft==0.3.0
torch==2.0.1+cu118
accelerate==0.19.0
safetensors==0.3.1
einops==0.6.1
wandb==0.15.3
bitsandbytes==0.39.0
scipy==1.10.1

There is a choice of three kernels:

torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False

Currently, only flash attention is on. Try enabling the other options as well.

Doing this giving the below error:

Traceback (most recent call last):
  File "falcontune/run.py", line 93, in <module>
    main()
  File "falcontune/run.py", line 89, in main 
    args.func(args)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/finetune.py", line 162, in fin
etune
    trainer.train()
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)  
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/trainer.py", line 2767, in compute_loss
    outputs = model(**inputs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/falcon/model.py", line 1070, in forward
    transformer_outputs = self.transformer(  
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/falcon/model.py", line 965, in forward
    outputs = block(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/falcon/model.py", line 634, in forward
      attn_outputs = self.self_attention(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/falcon/model.py", line 486, in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/lora.py", line 54, in forward
    result = self.quant_class.forward(self, x)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/backend/triton/quantlinear.py", line 13, in forward
    out = AutogradMatmul.apply(
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/backend/triton/autograd.py", line 11, in forward
    output = tu.triton_matmul(x, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/backend/triton/triton_utils.py", line 246, in triton_matmul
    matmul_248_kernel[grid](input, qweight, output,
  File "/home/users/user/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/backend/triton/custom_autotune.py", line 110, in run
    return self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **kwargs, **config.kwargs)
  File "<string>", line 24, in matmul_248_kernel
ValueError: Pointer argument (at 1) cannot be accessed from Triton (cpu tensor?)

I was having this same issue on google colab v100, switching to a100 fixed it for me.

Any fix for this? I'm still getting this issue.

In V100, we need enable the mem_efficient mode, it doesn't support native flash attention.

--- a/falcontune/model/falcon/model.py
+++ b/falcontune/model/falcon/model.py
@@ -523,7 +523,7 @@ class Attention40B(nn.Module):
             key_layer_ = key_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)
             value_layer_ = value_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)

-            with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
+            with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=True):
                 attn_output = F.scaled_dot_product_attention(
                     query_layer_, key_layer_, value_layer_, None, 0.0, is_causal=True
                 )