vid2seq | the way to evaluate the model on paragraph captioning

Question

vid2seq | the way to evaluate the model on paragraph captioning

PKUCSS opened this issue a year ago · comments

Thanks for the great work! I have a question about the way to evaluate the model on paragraph captioning: do you fine-tune the pre-trained checkpoint on the paragraph captioning task, or just remove the event boundary predictions from the outputs of the dense captioning model for the evaluation on paragraph captioning?

Sishuo Chen · Answer 1 · Thu May 11 2023 15:08:28 GMT+0800 (China Standard Time)

@antoyang @a-nagrani Dear authors, thanks again for your great work. Could you please answer the above question, so that more followers could fairly evaluate your work on video paragraph captioning?

Antoine Yang · Answer 2 · Thu May 11 2023 16:43:22 GMT+0800 (China Standard Time)

If I recall correctly, I removed the event boundary predictions from the outputs of the dense captioning model. But finetuning the pretrained model without time tokens should work fine too given Vid2Seq's performance on video clip captioning benchmarks.

Sishuo Chen · Answer 3 · Thu May 11 2023 16:45:47 GMT+0800 (China Standard Time)

@antoyang Thanks for the quick response!