How to initialize the vision encoder?

Question

How to initialize the vision encoder?

kaisugi opened this issue 2 years ago · comments

First of all, great work!!
I strongly believe this model has made big contribution to the Vision-and-Language community in Japan.

I find there is no description about the initialization of the vision encoder in CLIP/CLOOB.
Did you use some pre-trained weights available in HuggingFace, or just randomly initialize and train it from scratch?

Kaito Sugimoto commented 2 years ago

Thanks!

mkshing · Answer 1 · Mon May 16 2022 10:16:09 GMT+0800 (China Standard Time)

Hi,
Thank you for your interest!
For your question about the vision encoder, we use Google's ViT-base-patch16 pre-trained model. (link)
I additionally add the description to the README so that everyone could know it now.
https://huggingface.co/rinna/japanese-clip-vit-b-16#model-architecture