vid2seq | the way to evaluate the model on paragraph captioning
PKUCSS opened this issue · comments
Thanks for the great work! I have a question about the way to evaluate the model on paragraph captioning: do you fine-tune the pre-trained checkpoint on the paragraph captioning task, or just remove the event boundary predictions from the outputs of the dense captioning model for the evaluation on paragraph captioning?
@antoyang @a-nagrani Dear authors, thanks again for your great work. Could you please answer the above question, so that more followers could fairly evaluate your work on video paragraph captioning?
If I recall correctly, I removed the event boundary predictions from the outputs of the dense captioning model. But finetuning the pretrained model without time tokens should work fine too given Vid2Seq's performance on video clip captioning benchmarks.
@antoyang Thanks for the quick response!