arielnlee / Platypus

Code for fine-tuning Platypus fam LLMs using LoRA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Merge LLM

0three opened this issue · comments

Hi, Glad to see your model are on the top of Open LLM leaderboard!

Could you plz share your methods of merging LLMs?

Just a simple mixture of weight like https://github.com/donaldafeith/Pytorch_Merge?

Yes I have this question too, do you simply merge the adapter weights from your fine-tuning by averaging with other base/instruction-FT models? Or do you do a weighted average with the weight tuned on a val set? Also did you try merging multiple LoRAs from different fine-tuned models, and does that improve / degrade performance?

Seems like they use:

model = model.merge_and_unload()

which is based on simple additive merging (from the code here):
https://github.com/huggingface/peft/blob/a916465ad0970944f3241305071d9b79fae55b59/src/peft/tuners/lora.py#L794-L802
Could you please confirm this?

Thanks for your interest. That is correct, it is a simple linear merge (for now...). We played around with the different types of LoRA modules, how the training data affects the outcome of the merge, how merging fine-tunes that used different LoRA modules works, etc.

From our experience, the outcome of merging two (or more) LoRA based models is very much dependent on 1) the LoRA modules both merged models were fine-tuned with (i.e. did one model use up/down/gate proj and the other k/v/q/o proj 2) the training data, 3) the performance of both original models on whatever benchmarks you're using, and 4) (I think, but am still working on quantitative tests to explore this) the order of the LoRA merge. I believe the order of the merge also affects the "expertise" of the model.

Thanks for the prompt response. It is interesting that the order of the merge seems to play a role. I wouldn't have guessed that since additive merging seems permutation invariant (or maybe I misunderstood something), do you have an intuitive justification for why order seems to play a role? I would be very curious to know more about the quantitative results too!

That was my thought too, initially (that order wouldn't matter, which is why it is not discussed in the paper we recently released). I only started looking into it because when we originally merged Platypus-70B model with Dolphin, it was the only merge we had at the time that actually did worse than its original counterpart (the rest of our merges were better than both originals). Thanks again for your interest, follow-up with me in a week and hopefully I'll have additional insight and experiments to share! ☺️

commented

Thanks for your great work! I am also a little confused about the way of merging, is it merging the LoRA modules (i.e. merging the low-rank decomposition matrices B and A separately) or merging the entire two fine-tuned LLMs?

commented

sorry, I can't see them call the function peft.lora.merge() in this repo, am I missing anything?

They call the peft wrapper function here:

model = model.merge_and_unload()

This then calls the merge function linked above internally I think!

commented

That's just a normal merge() operation for LoRA, which is used to merge the learned LoRA module into the original model. In this way, it seems no more novel things than otherwise.

Right, I agree with you that it is the typical merging strategy used. However, I'm not sure I fully get the novelty aspect---I did not get the impression from the paper that they used a novel merging strategy, rather that merging with already instruction-fine-tuned models brought them the gains they see. I might be mistaken though, happy to hear your perspective on this! Maybe @arielnlee could pitch in too.

commented

Really cool paper! Regarding for the merging, maybe the procedure / method from LoraHub can give some inspiration: https://github.com/sail-sg/lorahub

First of all - I love this model! Great work from your team :)

I've got a dumb question about merging models and I'm wondering if someone would be able to help me.

How do you merge models when you have a LoRA adapter for one model (e.g. an adapter trained on the Platypus dataset using frozen Llama 2 weights) and only the base weights of a second model (e.g. OpenOrca)? While I understand mixing two LoRA adapters, wouldn't the relationship between the weights and the outputs that the adapter learns not hold when you apply it to another fine-tuned model (like OpenOrca) that may have quite different weights to Llama 2?
To the best of my knowledge, I understand that OpenOrca is not trained using LoRA, but by directly training the weights, so will the weights of the model not negatively affect the projection that the LoRA adapters have learned? Or is the assumption that even after fine-tuning, the weights of OpenOrca are similar enough to Llama 2 to allow the adapter to work well.

Your model is clearly excellent, I just want to understand how.

As a secondary side question: Can you merge the weights of models without using LoRA adapters and get good results? I'd love to be able to merge Stable Platypus 2 with a checkpoint of Llama 2 that has been extensively trained on Japanese so that it could potentially become as smart as Stable Platypus 2 but in Japanese instead of English. I know Stable Platypus 2 is already pretty damn good at Japanese, I'd just like to make it even better.

Thanks again!

Hi, thanks for the great work!

I have some tiny questions about the approaches of the paper.

  1. If I do not misunderstand the paper, after fine-tuning the base model (e.g., LLaMA-v2) with LoRA, we can directly merge the adapter with another instruct-tuning model (e.g., OpenOrcaxOpenChat) to improve the performance. But why don't we fine-tune the instruct-tuning model (e.g., OpenOrcaxOpenChat) on the proposed dataset directly? Do you have any performance comparison results about these two approaches (merge with another tuned model v.s. directly fine-tune another tuned model), of course under the same training budget? Or the experiment results about the performance gain if we merge more than two different instruct-tuned models.
  2. Are there any performance gaps between merging entire model weights and merging the adapter only?

Please let me know if I misunderstand anything
Thanks for the great work again.