Merge HF LoRa adapter with a quantized GPT-J model using ggml

Question

Merge HF LoRa adapter with a quantized GPT-J model using ggml

webpolis opened this issue 8 months ago · comments

Hello!

I have fine-tuned a GPT-J base model (loaded in 4 bits) using HF + LoRa. I quantized the same base model using ggml to q4_0, and it loads perfectly fine using the built examples/gpt-j binaries. Since it's not yet possible to save 4-bit models together with the adapters using HF's 4-bit loaded model, I must find a different way to accomplish this.

I want to "merge" the LoRa adapters (convert to ggml first?) with this q4_0 version so I can perform inference on the CPU.

Any hints?