rinnakk / japanese-clip

Japanese CLIP by rinna Co., Ltd.

Home Page:https://huggingface.co/rinna

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to initialize the vision encoder?

kaisugi opened this issue · comments

First of all, great work!!
I strongly believe this model has made big contribution to the Vision-and-Language community in Japan.

I find there is no description about the initialization of the vision encoder in CLIP/CLOOB.
Did you use some pre-trained weights available in HuggingFace, or just randomly initialize and train it from scratch?

Hi,
Thank you for your interest!
For your question about the vision encoder, we use Google's ViT-base-patch16 pre-trained model. (link)
I additionally add the description to the README so that everyone could know it now.
https://huggingface.co/rinna/japanese-clip-vit-b-16#model-architecture

Thanks!