nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Batch inference

RyanChen1997 opened this issue · comments

Sorry, I am new for it.
According to the code in inference_wizardcoder.py, i have created a service and performed benchmark-test. The result is: when the concurrency is 5, it takes about 35s on average.
I want to reduce time cost and increase concurrency.
Now one request is processed once (cal model with one input every time), is there any concurrency to allow multiple requests to be processed once?

for num, line in enumerate(input_data):
        one_data = line
        id = one_data["idx"]
        instruction = one_data["Instruction"]
        print(instruction)
        _output = evaluate(instruction, tokenizer, model) # call model with one input every time
        final_output = _output[0].split("### Response:")[1].strip()
        new_data = {
            "id": id,
            "instruction": instruction,
            "wizardcoder": final_output
        }
        output_data.write(new_data)

I want to change the logic let it can batch inference.
Thanks a lot!

@ChiYeungLaw @nlpxucan
I am trying to run batch inference by making the small change on this line. However, since different inputs may not be of the same size, there needs to be a left_side padding done on the smaller inputs.

My question is what should be the padding token to be used. The default padding token (ie tokenizer.pad_token) is: '[PAD]'. However, I have some examples online (such as this and this) which explicitly set this padding token to be tokenizer.eos_token ie '<|endoftext|>'.

What is the correct padding token to be used ?
Thanks.

def generate(self, batch_data):
        if isinstance(batch_data, list):
            prompts = []
            for data in batch_data:
                prompts.append(self._generate_prompt(data))
        else:
            prompts = self._generate_prompt(batch_data)
        inputs = self.tokenizer(
            prompts, return_tensors="pt", max_length=256, truncation=True, padding=True
        )
        input_ids = inputs["input_ids"].to(self.device)
        with torch.no_grad():
            generation_output = self.model.generate(
                input_ids=input_ids,
                generation_config=self.generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=self.max_new_tokens,
            )
        s = generation_output.sequences
        output = self.tokenizer.batch_decode(s, skip_special_tokens=True)
        return output

It work

@RyanChen1997 can you also provide the definition of self._generate_prompt ??

@RyanChen1997 thank you..
how to load wizardLM in multiple gpus ,simple ddp will work ?

def generate(self, batch_data):
        if isinstance(batch_data, list):
            prompts = []
            for data in batch_data:
                prompts.append(self._generate_prompt(data))
        else:
            prompts = self._generate_prompt(batch_data)
        inputs = self.tokenizer(
            prompts, return_tensors="pt", max_length=256, truncation=True, padding=True
        )
        input_ids = inputs["input_ids"].to(self.device)
        with torch.no_grad():
            generation_output = self.model.generate(
                input_ids=input_ids,
                generation_config=self.generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=self.max_new_tokens,
            )
        s = generation_output.sequences
        output = self.tokenizer.batch_decode(s, skip_special_tokens=True)
        return output

It work

@RyanChen1997: Shouldn't you also pass inputs["attention_mask"] to generate fn when using batch inference? If not, the default attention_mask will be all 1s, ie attending to even pad tokens (cf. https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1572)