Generation takes forever
Kira-Pgr opened this issue · comments
KawaiiPGR commented
Env
- Python 3.9.18
- NVIDIA GeForce RTX 4060 Laptop GPU
- pytorch 2.1.1
- airllm 2.8.3
- Build cuda_12.2.r12.2/compiler.32965470_0
Model used
https://huggingface.co/152334H/miqu-1-70b-sf
Code
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("/mnt/d/miqu-1-70b-sf", compression='4bit')
input_text = [
"[INST] eloquent high camp prose about a cute catgirl [/INST]",
]
model.tokenizer.pad_token = model.tokenizer.eos_token
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=True)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=20,
use_cache=False,
return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
Problem
Keep running layers(self.running_device):
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████| 83/83 [27:57<00:00, 20.22s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████| 83/83 [30:15<00:00, 21.87s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|████████████████████████████████████████████| 83/83 [1:04:38<00:00, 46.73s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|████████████████████████████████████████████| 83/83 [1:13:57<00:00, 53.47s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 23%|██████████▌ | 19/83 [11:01<37:06, 34.79s/it]
Loading model didn't give errors, but says this
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
not support prefetching for compression for now. loading with no prepetching mode.
Solution from #107 didn't work
Ahmed Breem commented
any updates?
shailin1 commented
getting the same error on 13700k, 4090 and 32 GB RAM. Was this resolved?
leedahae340 commented
这个不是问题,和这里有关系max_new_tokens=20,如果是20,就要跑20次,如果是200,就要跑200次。。。
有点慢