zengyan-97 / X-VLM

X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance of different vision encoders

AI-in-Health opened this issue · comments

Thanks for your great sharing.

else: # deit, worse than clip-vit/swin...

As shown above, you mentioned in the code that initilaizing the vision encoder with deit is worser than clip-vit and swin.

Do you have some supporting results? For example, the performance on Image-Text Retrieval with deit or swin

Hi,

Sorry for my late rely.

Yes, i tried. deit is the worst. But I can't find my preliminary experiment results.
FYI, clip-vit-base performs similarly to swin-base or beit2-vit-base. beit2-vit-large is better than swin-large.