Merge HF LoRa adapter with a quantized GPT-J model using ggml
webpolis opened this issue · comments
Nicolas Iglesias commented
Hello!
I have fine-tuned a GPT-J base model (loaded in 4 bits) using HF + LoRa. I quantized the same base model using ggml to q4_0, and it loads perfectly fine using the built examples/gpt-j binaries. Since it's not yet possible to save 4-bit models together with the adapters using HF's 4-bit loaded model, I must find a different way to accomplish this.
I want to "merge" the LoRa adapters (convert to ggml first?) with this q4_0 version so I can perform inference on the CPU.
Any hints?