Yui010206 / SeViLA

[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering

Home Page:https://arxiv.org/abs/2305.06988

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

finetuning result is so bad when using my pretrained checkpoints on QVhighlights.

fake-warrior8 opened this issue · comments

commented

Hi, I used your pre-trained SeViLA localizer checkpoints on QVHighlights to fine-tune NExT-QA and got similar NExT-QA results as in your paper (73.2 vs. 73.8). However, when I used your script to first pretrain a sevila localizer and then finetune NExT-QA using my pretrained SeViLA localizer checkpoints, I got only an accuracy of 45 in the first epoch (71 using the checkpoints your gave). I found that your checkpoint is 815M and my pretrained sevila localizer on QVHighlights is 1.4G. Is there any post-processing for the pretrained sevila localizer checkpoints?

Hi,

Did you solve this problem? I am in the same situation, and I got the pretrained sevila localizer on QVHighlights with 1.4G too.

Best

commented

Hi,

Did you solve this problem? I am in the same situation, and I got the pretrained sevila localizer on QVHighlights with 1.4G too.

Best

The pretrained ckpt includes BLIP-2 Q-former localizer parameters and some t5 parameters, while the downstream finetuning stage requires only the BLIP-2 Q-form localizer and a original BLIP-2 Q-former answerer parameters. You should combine the pretrained ckpt and the original BLIP-2 parameters to get a new ckpt for downstream finetuning.

Thanks for the replying. Yes, I figured out by printing those keys. And I directly replaced the downloaded ckp's weights related to loc.