Discussing LoraHub: Exploration, Implementation, and Potential Improvements

Question

ChuxiJ opened this issue a year ago · comments

LoraHub is a really great idea, similar to a few ideas I thought of yesterday.

Unlike MOE, instead of training many domain experts, it trains multiple Loras on a large base model.
During inference, a router mechanism is used to select which Lora weights to combine for inference. Only one base model is needed for deployment. Like a chain of trees, if you infer several times, you can achieve better performance.
The training parameters and data for Lora can be made more aggressive, ready to scale up. For example, a 65B base model, trained on high-quality data from 8 different domains, separately trains 8 1B Loras. Has anyone compared whether its performance is better or worse than MOE?
It is not yet very clear which base models were chosen in the paper, how the training parameters were, how the Loras were merged for inference, and many other details. I am waiting for the code to be published for more details.
How to cleverly design the router mechanism is also worth researching and discussing. Are there any related materials to recommend?

Qian · Answer 1 · Thu Jul 27 2023 16:15:35 GMT+0800 (China Standard Time)

Thanks for your question @ChuxiJ , and I'd answer them here:

Yes, this is exactly what lorahub wants to implement. And we have also discussed the relationship of lorahub with moe in the related work section.
I'm not sure if the scalar weight can be a router since the router mechanism in MoE should include many router weights.
😂 It's a little expensive for our Lab to train such kind of models. But it's worthy to try.
We clearly state that Flan-t5-large is the base model. You can checkout the first section and the experimental results for details. All these details are already in the paper.
The gradient-free method used in lorahub may be a great solution.