Yui010206 / SeViLA

[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering

Home Page:https://arxiv.org/abs/2305.06988

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

There is a writing error in the paper

zchuz opened this issue · comments

There is a writing error in the paper
The batchsize in the pre-training and refinement phase should be 16 per gpu instead of 64 per gpu.

For pre-training and self-refinement, we sample 4 frames from each video samples, and calculate gradients on reshaped batch. The batch size on each GPU is 4x16, consistent with the specifications outlined in the paper. Thanks for your suggestion, we will clarify this later.

Thank you for your reply, I did not take into account that each video samples four frames.