About the weight and bias of Conv_in

Question

About the weight and bias of Conv_in

Tianhao-Qi opened this issue 9 months ago · comments

@zengyh1900 @hellock @eltociear As your paper mentions, you will keep the first 4 channels of conv_in layer in the first dimension frozen, however, if you compare the weights and biases of sd v1.5 and pia you provide, they are actually not the same!

LeoXing1996 · Answer 1 · Wed Jan 03 2024 20:54:46 GMT+0800 (China Standard Time)

Hey @Tianhao-Qi, before video training, we finetune the image UNet on WebVid dataset. The first 4 channels comes from the finetuned UNet, which makes it different from original SD15 one.

Tianhao-Qi · Answer 2 · Wed Jan 03 2024 21:03:47 GMT+0800 (China Standard Time)

Thanks for your reply, what's the benefit of finetuning the image unet on WebVid dataset? I haven't seen any mention in your paper.

LeoXing1996 · Answer 3 · Thu Jan 04 2024 20:26:24 GMT+0800 (China Standard Time)

@Tianhao-Qi T, we introduced our training method in section 3.3.
Following the training strategy of animatediff, we first train a domain adapter on webvid. As animatediff has not released the weights for their LoRA version of the domain adapter, we directly fine-tune the entire UNet, transforming it into a 'domain adapter' for webvid.

Ernie Chu · Answer 4 · Wed Jun 05 2024 12:45:12 GMT+0800 (China Standard Time)

@LeoXing1996 If you used the fine-tuned UNet, doesn't it mean that the generated videos inherit the low visual quality of the video dataset?