dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About finetune on f30k.

GuoBruce opened this issue · comments

Hi, I am very interested in your work! I am wondering why use 15 texts as negative samples instead of 1 text during the finetune period. And what do you think training the model from scratch only using flickr30k dataset?

Hi @GuoBruce,

We didn't test training filckr30k from scratch but I believe the result would be much worse.
The number 15 is totally arbitrary. (though Pixel-BERT used similar 20 negative samples for IR/TR)

Thanks for your reply! I did a try and found that it is absolutely worse than I hoped. The recall@1 is between 10 and 20.