PKU-YuanGroup / Chat-UniVi

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Home Page:https://arxiv.org/abs/2311.08046

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

about temporal merging

xxtars opened this issue · comments

Thank you very much for making your work open source!

I have some questions while reading the paper. How do you ensure that the frames $f^m$ within an event $$ are continuous after clustering frame-level features? Is there any algorithmic constraint for this? I didn't seem to find a related description in the paper or code.

Looking forward to your reply!

Thank you for bringing up this issue. In our algorithmic, we don't strictly require frames to be adjacent. However, this flexibility can indeed disrupt the video's timing. Do you have any suggestions on how we can address this?