rinnakk / japanese-clip

Japanese CLIP by rinna Co., Ltd.

Home Page:https://huggingface.co/rinna

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to initialize the vision encoder?

kaisugi opened this issue · comments

First of all, great work!!
I strongly believe this model has made big contribution to the Vision-and-Language community in Japan.

I find there is no description about the initialization of the vision encoder in CLIP/CLOOB.
Did you use some pre-trained weights available in HuggingFace, or just randomly initialize and train it from scratch?

Thank you for your interest!
For your question about the vision encoder, we use Google's ViT-base-patch16 pre-trained model. (link)
I additionally add the description to the README so that everyone could know it now.
