Simple question about generate function when inferencing
erjui opened this issue · comments
Hi, thanks for the nice work first of all.
As far as I understand, when inferencing, we forward the all previous tokens step by step to generate the next token.
However, in generate function in llama_adapter.py, the model only gets the previous 1 token to generate the next token, which is not natural for me.
Could you explain why only the previous 1 token is used to generate the next token??
prev_pos = cur_pos
in 260th line from
https://github.com/OpenGVLab/LLaMA-Adapter/blob/main/llama_adapter_v2_multimodal7b/llama/llama_adapter.py#L260
I really appreciate any help you can provide.
Hi @erjui , we follow LLaMA's official repo and use token cache to speed up inference.
LLaMA-Adapter/llama_adapter_v2_multimodal7b/llama/llama.py
Lines 163 to 175 in 5e8c8b6
thanks a lot for the answer!