zengyan-97 / X-VLM

Thanks for your great sharing.

Line 121 in e7b9602

else: # deit, worse than clip-vit/swin...

As shown above, you mentioned in the code that initilaizing the vision encoder with deit is worser than clip-vit and swin.

Do you have some supporting results? For example, the performance on Image-Text Retrieval with deit or swin

Hi,

Sorry for my late rely.

Yes, i tried. deit is the worst. But I can't find my preliminary experiment results.
FYI, clip-vit-base performs similarly to swin-base or beit2-vit-base. beit2-vit-large is better than swin-large.

Performance of different vision encoders