About the weight and bias of Conv_in
Tianhao-Qi opened this issue · comments
@zengyh1900 @hellock @eltociear As your paper mentions, you will keep the first 4 channels of conv_in layer in the first dimension frozen, however, if you compare the weights and biases of sd v1.5 and pia you provide, they are actually not the same!
Hey @Tianhao-Qi, before video training, we finetune the image UNet on WebVid dataset. The first 4 channels comes from the finetuned UNet, which makes it different from original SD15 one.
Thanks for your reply, what's the benefit of finetuning the image unet on WebVid dataset? I haven't seen any mention in your paper.
@Tianhao-Qi T, we introduced our training method in section 3.3.
Following the training strategy of animatediff, we first train a domain adapter on webvid. As animatediff has not released the weights for their LoRA version of the domain adapter, we directly fine-tune the entire UNet, transforming it into a 'domain adapter' for webvid.
@LeoXing1996 If you used the fine-tuned UNet, doesn't it mean that the generated videos inherit the low visual quality of the video dataset?