The result of direct inference without using VLLM is wrong, is it a problem with the model?

Question

The result of direct inference without using VLLM is wrong, is it a problem with the model?

lizhongv opened this issue 6 months ago · comments

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, GenerationConfig
# from vllm import LLM, SamplingParams
import torch
device = torch.device(0)


def load_tokenizer_and_model():
  tokenizer = AutoTokenizer.from_pretrained('/root/autodl-tmp/selfrag_llama2_7b')
  config = AutoConfig.from_pretrained('/root/autodl-tmp/selfrag_llama2_7b')
  model = AutoModelForCausalLM.from_pretrained(
    '/root/autodl-tmp/selfrag_llama2_7b',
    torch_dtype=torch.float16,
    config=config
  )

  model.to(device)
  model.eval()
  return tokenizer, model

def format_prompt(input, paragraph=None):
  prompt = "### Instruction:\n{0}\n\n### Response:\n".format(input)
  if paragraph is not None:
    prompt += "[Retrieval]<paragraph>{0}</paragraph>".format(paragraph)
  return prompt

 if  __name__ == "__main__":
  query_1 = "Leave odd one out: twitter, instagram, whatsapp."
  query_2 = "Can you tell me the difference between llamas and alpacas?"
  queries = [query_1, query_2]
  tokenizer, model = load_tokenizer_and_model()

  for q in queries:
    # inputs = tokenizer([format_prompt(query) for query in queries], return_tensors='pt')
    inputs = tokenizer(format_prompt(q), return_tensors='pt')
    input_ids = inputs['input_ids'].to(device)

    generation_config = GenerationConfig(
      temperature=0.0,
      top_p=1.0,
      max_tokens=100
    )
    with torch.no_grad():
      generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        repetition_penalty=1.2,
      )
    output = generation_output.sequences[0]
    output = tokenizer.decode(output, skip_special_tokens=True)
    print(output)

"""
'### Instruction:
Leave odd one out: twitter, instagram, whatsapp.

### Response:
Tw'


'### Instruction:
Can you tell me the difference between llamas and alpacas?

### Response:
S'
"""

Akari Asai · Answer 1 · Thu Dec 21 2023 12:51:32 GMT+0800 (China Standard Time)

Thank you for reporting! Did the model work okay with vllm? If so, the issue might be from the libraries.
When we were working on earlier versions of Self-RAG back in June, I had multiple issues related to inconsitent predictions between vllm and transformers (e.g.., transformers batch decoding with Llama2 had some issues, or vllm predictions aren't exactly same as transformers when they should match). For those cases, it might be better to check some open issues in vllm or transformers.

wanghao · Answer 2 · Tue Jan 02 2024 17:35:55 GMT+0800 (China Standard Time)

我也遇到了这个问题，非常的奇怪。请问有解决的思路吗？

wanghao · Answer 3 · Tue Jan 02 2024 17:36:16 GMT+0800 (China Standard Time)

I had the same problem, and it was very strange. Is there a solution?