Batch inference

Question

Batch inference

RyanChen1997 opened this issue a year ago · comments

Sorry, I am new for it.
According to the code in inference_wizardcoder.py, i have created a service and performed benchmark-test. The result is: when the concurrency is 5, it takes about 35s on average.
I want to reduce time cost and increase concurrency.
Now one request is processed once (cal model with one input every time), is there any concurrency to allow multiple requests to be processed once?

for num, line in enumerate(input_data):
        one_data = line
        id = one_data["idx"]
        instruction = one_data["Instruction"]
        print(instruction)
        _output = evaluate(instruction, tokenizer, model) # call model with one input every time
        final_output = _output[0].split("### Response:")[1].strip()
        new_data = {
            "id": id,
            "instruction": instruction,
            "wizardcoder": final_output
        }
        output_data.write(new_data)

I want to change the logic let it can batch inference.
Thanks a lot!

Anmol Agarwal · Answer 1 · Wed Aug 02 2023 21:01:26 GMT+0800 (China Standard Time)

@ChiYeungLaw @nlpxucan
I am trying to run batch inference by making the small change on this line. However, since different inputs may not be of the same size, there needs to be a left_side padding done on the smaller inputs.

My question is what should be the padding token to be used. The default padding token (ie tokenizer.pad_token) is: '[PAD]'. However, I have some examples online (such as this and this) which explicitly set this padding token to be tokenizer.eos_token ie '<|endoftext|>'.

What is the correct padding token to be used ?
Thanks.

RyanChen · Answer 2 · Thu Aug 03 2023 14:58:03 GMT+0800 (China Standard Time)

def generate(self, batch_data):
        if isinstance(batch_data, list):
            prompts = []
            for data in batch_data:
                prompts.append(self._generate_prompt(data))
        else:
            prompts = self._generate_prompt(batch_data)
        inputs = self.tokenizer(
            prompts, return_tensors="pt", max_length=256, truncation=True, padding=True
        )
        input_ids = inputs["input_ids"].to(self.device)
        with torch.no_grad():
            generation_output = self.model.generate(
                input_ids=input_ids,
                generation_config=self.generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=self.max_new_tokens,
            )
        s = generation_output.sequences
        output = self.tokenizer.batch_decode(s, skip_special_tokens=True)
        return output

It work

jaideep11061982 · Answer 3 · Mon Aug 07 2023 00:18:05 GMT+0800 (China Standard Time)

@RyanChen1997 can you also provide the definition of self._generate_prompt ??

RyanChen · Answer 4 · Mon Aug 07 2023 19:04:20 GMT+0800 (China Standard Time)

@jaideep11061982 Just copy the function named generate_prompt from https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/src/inference_wizardcoder.py

jaideep11061982 · Answer 5 · Tue Aug 08 2023 15:14:06 GMT+0800 (China Standard Time)

@RyanChen1997 thank you..
how to load wizardLM in multiple gpus ,simple ddp will work ?

prabhatp251 · Answer 6 · Mon Dec 11 2023 16:52:30 GMT+0800 (China Standard Time)

def generate(self, batch_data):
        if isinstance(batch_data, list):
            prompts = []
            for data in batch_data:
                prompts.append(self._generate_prompt(data))
        else:
            prompts = self._generate_prompt(batch_data)
        inputs = self.tokenizer(
            prompts, return_tensors="pt", max_length=256, truncation=True, padding=True
        )
        input_ids = inputs["input_ids"].to(self.device)
        with torch.no_grad():
            generation_output = self.model.generate(
                input_ids=input_ids,
                generation_config=self.generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=self.max_new_tokens,
            )
        s = generation_output.sequences
        output = self.tokenizer.batch_decode(s, skip_special_tokens=True)
        return output

It work

@RyanChen1997: Shouldn't you also pass inputs["attention_mask"] to generate fn when using batch inference? If not, the default attention_mask will be all 1s, ie attending to even pad tokens (cf. https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1572)