OpenGVLab / LLaMA-Adapter

Hi, thanks for the nice work first of all.

As far as I understand, when inferencing, we forward the all previous tokens step by step to generate the next token.
However, in generate function in llama_adapter.py, the model only gets the previous 1 token to generate the next token, which is not natural for me.

Could you explain why only the previous 1 token is used to generate the next token??

prev_pos = cur_pos in 260th line from
https://github.com/OpenGVLab/LLaMA-Adapter/blob/main/llama_adapter_v2_multimodal7b/llama/llama_adapter.py#L260

I really appreciate any help you can provide.

Hi @erjui , we follow LLaMA's official repo and use token cache to speed up inference.

LLaMA-Adapter/llama_adapter_v2_multimodal7b/llama/llama.py

Lines 163 to 175 in 5e8c8b6

    
           if not self.training: 
        
               self.cache_k = self.cache_k.to(xq) 
        
               self.cache_v = self.cache_v.to(xq) 
        
               self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk 
        
               self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv 
        
               keys = self.cache_k[:bsz, : start_pos + seqlen] 
        
               values = self.cache_v[:bsz, : start_pos + seqlen] 
        
           else: 
        
               assert start_pos==0 
        
               keys = xk 
        
               values = xv

thanks a lot for the answer!

	if not self.training:
	self.cache_k = self.cache_k.to(xq)
	self.cache_v = self.cache_v.to(xq)

	self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
	self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv

	keys = self.cache_k[:bsz, : start_pos + seqlen]
	values = self.cache_v[:bsz, : start_pos + seqlen]
	else:
	assert start_pos==0
	keys = xk
	values = xv

Simple question about generate function when inferencing