MILVLG / imp

a family of highly capabale yet efficient large multimodal models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Quantized model latency issue.

abhijitherekar opened this issue · comments

Hi, I am trying to see if I can load a quantized model of this.

When I load in 4-bit, the model size is smaller but the latency significantly increases.

Not sure if there needs to be any changes to be done to support quantization.

Please, let me know.

I can also help in creating a MR to make the quantized model better.

Thanks

Hi, thanks for your interesting and response!

Do you mean inference in a quantized format or training using QLoRA which is not supported for Imp yet?

If you mean inference, check if use_cache is enabled, and check if the model stop generation after a </s> token is generated.

Hi @ParadoxZW , thanks for the reply.

I am currently using it for inference.
Here is what I have tried:

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

#Create model
model = AutoModelForCausalLM.from_pretrained(
    "MILVLG/imp-v1-3b", 
    torch_dtype=torch.float16, 
    device_map="auto",
    quantization_config=bnb_config,
    trust_remote_code=True)
and then while calling forward pass, I set the `use_cache` to true:
    output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    images=image_tensor,
    use_cache=True)[0]

When I load the 4-bit model, I see the latency to be 1.2 seconds.

But if I load the normal model it takes 0.2 seconds for the same image.

I will check further on the token generation.

If you can point to the file where I need to check and try, I can give it a shot in fixing it if you think its a issue.

Open to help and contribute.

Thanks

I check for EOS token, I don't see it generated in the output.

So, I am not sure why a 4-bit quantized model would take more time for inference than the base model.

Are we missing anything ? @ParadoxZW , what else can I check ?

you can check the output to see whether the quantized model generate much longer outputs than the base model.

@abhijitherekar We have invesitgated the quantization methods recently. According to the existing studies, using int8 or other 4-bit quantization strategies will indeed slow down the inference speed. More explanation may be referred to the paper "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale".

If you really find some bugs in our code, please let us know.

Hi @MIL-VLG , thanks for the response.

To summaries on what I understand:
So, this int8 quantization makes the slower models go slower even further. But, this doesn't happen with larger models like llava. Is that right ?

Please, share your thoughts.
Thanks