predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Home Page:https://loraexchange.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why are qlora (4bit) and lora (16bit) adapter file sizes the same?

codybum opened this issue · comments

This is not a LoraX issue, but this community could have some insight into this question.

When I train a qlora adapter (4bit) there are clearly less resources being used and the adapter trains much faster. However, the saved size of the adapter is no smaller than a similarly trained lora (16bit) adapter. For small models this is not an issue, but for larger models and higher ranks, the size starts to become an issue.

There are numerous quantization methods and formats for full models, but I can't find much information on saving an adapter in a 4bit format vs 16bit when it has been trained in 4bit. There is mention of loading/saving 4bit formats here (bitsandbytes-foundation/bitsandbytes#753), but I don't know the current state.

Any thoughts?

In a different thread this was the response: "The quantization is only applied to the pre-trained weights, and the trainable adapter weights remain as float32 precision. Thus whatever the quantization setting you have chosen, the adapter weights always have the same size."

Seems like there should be a way to serialize adapter weights like we do pre-trained models.