OpenGVLab / LLaMA-Adapter

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Simple question about generate function when inferencing

erjui opened this issue · comments

Hi, thanks for the nice work first of all.

As far as I understand, when inferencing, we forward the all previous tokens step by step to generate the next token.
However, in generate function in llama_adapter.py, the model only gets the previous 1 token to generate the next token, which is not natural for me.

Could you explain why only the previous 1 token is used to generate the next token??

prev_pos = cur_pos in 260th line from
https://github.com/OpenGVLab/LLaMA-Adapter/blob/main/llama_adapter_v2_multimodal7b/llama/llama_adapter.py#L260

I really appreciate any help you can provide.

Hi @erjui , we follow LLaMA's official repo and use token cache to speed up inference.

if not self.training:
self.cache_k = self.cache_k.to(xq)
self.cache_v = self.cache_v.to(xq)
self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv
keys = self.cache_k[:bsz, : start_pos + seqlen]
values = self.cache_v[:bsz, : start_pos + seqlen]
else:
assert start_pos==0
keys = xk
values = xv

thanks a lot for the answer!