AkariAsai / self-rag

This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.

Home Page:https://selfrag.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The result of direct inference without using VLLM is wrong, is it a problem with the model?

lizhongv opened this issue · comments

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, GenerationConfig
# from vllm import LLM, SamplingParams
import torch
device = torch.device(0)


def load_tokenizer_and_model():
  tokenizer = AutoTokenizer.from_pretrained('/root/autodl-tmp/selfrag_llama2_7b')
  config = AutoConfig.from_pretrained('/root/autodl-tmp/selfrag_llama2_7b')
  model = AutoModelForCausalLM.from_pretrained(
    '/root/autodl-tmp/selfrag_llama2_7b',
    torch_dtype=torch.float16,
    config=config
  )

  model.to(device)
  model.eval()
  return tokenizer, model

def format_prompt(input, paragraph=None):
  prompt = "### Instruction:\n{0}\n\n### Response:\n".format(input)
  if paragraph is not None:
    prompt += "[Retrieval]<paragraph>{0}</paragraph>".format(paragraph)
  return prompt

 if  __name__ == "__main__":
  query_1 = "Leave odd one out: twitter, instagram, whatsapp."
  query_2 = "Can you tell me the difference between llamas and alpacas?"
  queries = [query_1, query_2]
  tokenizer, model = load_tokenizer_and_model()

  for q in queries:
    # inputs = tokenizer([format_prompt(query) for query in queries], return_tensors='pt')
    inputs = tokenizer(format_prompt(q), return_tensors='pt')
    input_ids = inputs['input_ids'].to(device)

    generation_config = GenerationConfig(
      temperature=0.0,
      top_p=1.0,
      max_tokens=100
    )
    with torch.no_grad():
      generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        repetition_penalty=1.2,
      )
    output = generation_output.sequences[0]
    output = tokenizer.decode(output, skip_special_tokens=True)
    print(output)

"""
'### Instruction:
Leave odd one out: twitter, instagram, whatsapp.

### Response:
Tw'


'### Instruction:
Can you tell me the difference between llamas and alpacas?

### Response:
S'
"""

Thank you for reporting! Did the model work okay with vllm? If so, the issue might be from the libraries.
When we were working on earlier versions of Self-RAG back in June, I had multiple issues related to inconsitent predictions between vllm and transformers (e.g.., transformers batch decoding with Llama2 had some issues, or vllm predictions aren't exactly same as transformers when they should match). For those cases, it might be better to check some open issues in vllm or transformers.

我也遇到了这个问题,非常的奇怪。请问有解决的思路吗?

I had the same problem, and it was very strange. Is there a solution?