Performance of different vision encoders
AI-in-Health opened this issue · comments
AI4H Group commented
Thanks for your great sharing.
Line 121 in e7b9602
As shown above, you mentioned in the code that initilaizing the vision encoder with deit is worser than clip-vit and swin.
Do you have some supporting results? For example, the performance on Image-Text Retrieval with deit or swin
Yan Zeng commented
Hi,
Sorry for my late rely.
Yes, i tried. deit is the worst. But I can't find my preliminary experiment results.
FYI, clip-vit-base performs similarly to swin-base or beit2-vit-base. beit2-vit-large is better than swin-large.