lyogavin / Anima

33B Chinese LLM, DPO QLORA, 100K context, AirLLM 70B inference with single 4GB GPU

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generation takes forever

Kira-Pgr opened this issue · comments

Env

  • Python 3.9.18
  • NVIDIA GeForce RTX 4060 Laptop GPU
  • pytorch 2.1.1
  • airllm 2.8.3
  • Build cuda_12.2.r12.2/compiler.32965470_0

Model used

https://huggingface.co/152334H/miqu-1-70b-sf

Code

from airllm import AutoModel

MAX_LENGTH = 128
model = AutoModel.from_pretrained("/mnt/d/miqu-1-70b-sf", compression='4bit')
input_text = [
    "[INST] eloquent high camp prose about a cute catgirl [/INST]",
]
model.tokenizer.pad_token = model.tokenizer.eos_token
input_tokens = model.tokenizer(input_text,
                               return_tensors="pt",
                               return_attention_mask=False,
                               truncation=True,
                               max_length=MAX_LENGTH,
                               padding=True)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(),
    max_new_tokens=20,
    use_cache=False,
    return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

Problem

Keep running layers(self.running_device):

new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████| 83/83 [27:57<00:00, 20.22s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████| 83/83 [30:15<00:00, 21.87s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|████████████████████████████████████████████| 83/83 [1:04:38<00:00, 46.73s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|████████████████████████████████████████████| 83/83 [1:13:57<00:00, 53.47s/it]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device):  23%|██████████▌                                   | 19/83 [11:01<37:06, 34.79s/it]

Loading model didn't give errors, but says this

new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
not support prefetching for compression for now. loading with no prepetching mode.

Solution from #107 didn't work

any updates?

getting the same error on 13700k, 4090 and 32 GB RAM. Was this resolved?

这个不是问题,和这里有关系max_new_tokens=20,如果是20,就要跑20次,如果是200,就要跑200次。。。
有点慢