Why are the results in the SOTA table not consistent with ablation studies?

Question

Why are the results in the SOTA table not consistent with ablation studies?

takfate opened this issue a month ago · comments

Hello, thanks for your great work.
I read your paper, but have some confusion about the results.
I find the VCG scores are not higher than 3.0 in your ablation studies, but the performance of the 7B model is 3.12. Could you help me?

zhoudaquan · Answer 1 · Mon May 06 2024 22:48:11 GMT+0800 (China Standard Time)

Hello, thanks for your great work. I read your paper, but have some confusion about the results. I find the VCG scores are not higher than 3.0 in your ablation studies, but the performance of the 7B model is 3.12. Could you help me?

Hi,

Thanks for your interest. To save the computation, in the ablation of the impacts of pooling operation, we test the model under zero-shot setting: that is to say, the model are not trained on video dataset. We have verified that the zero-shot testing results are good indicators of the trained model.

I hope this clarify your question.

Best regards,
DQ

cgoe · Answer 2 · Tue May 07 2024 02:31:44 GMT+0800 (China Standard Time)

Thank you for your response. I've also tried adapting LLaVA to the video domain, but in my experiments, the performance in open-ended QA is significantly lower compared to PLLaVA. Could you share some tips or tricks? I trained the model for just one epoch and am wondering if the lower performance is related to the number of training epochs or if there are other factors involved?

cgoe · Answer 3 · Tue May 07 2024 04:31:53 GMT+0800 (China Standard Time)

The other confusion is in figure 9 about training LoRA with video samples. In this figure, the best result of the 7B model on VCG is not also higher than 3.0. Could you clear up my confusion?