magic-research / PLLaVA

Official repository for the paper PLLaVA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why are the results in the SOTA table not consistent with ablation studies?

takfate opened this issue · comments

commented

Hello, thanks for your great work.
I read your paper, but have some confusion about the results.
I find the VCG scores are not higher than 3.0 in your ablation studies, but the performance of the 7B model is 3.12. Could you help me?

Hello, thanks for your great work. I read your paper, but have some confusion about the results. I find the VCG scores are not higher than 3.0 in your ablation studies, but the performance of the 7B model is 3.12. Could you help me?

Hi,

Thanks for your interest. To save the computation, in the ablation of the impacts of pooling operation, we test the model under zero-shot setting: that is to say, the model are not trained on video dataset. We have verified that the zero-shot testing results are good indicators of the trained model.

I hope this clarify your question.

Best regards,
DQ

commented

Thank you for your response. I've also tried adapting LLaVA to the video domain, but in my experiments, the performance in open-ended QA is significantly lower compared to PLLaVA. Could you share some tips or tricks? I trained the model for just one epoch and am wondering if the lower performance is related to the number of training epochs or if there are other factors involved?

commented

The other confusion is in figure 9 about training LoRA with video samples. In this figure, the best result of the 7B model on VCG is not also higher than 3.0. Could you clear up my confusion?