Excellent work. I would like to ask what is the method and process for merging adjacent segments of narration into paragraph narration. Thank you very much.

Question

Excellent work. I would like to ask what is the method and process for merging adjacent segments of narration into paragraph narration. Thank you very much.

changqinyao opened this issue 2 months ago · comments

changqinyao commented 2 months ago

Zihao Yue · Answer 1 · Wed Jul 17 2024 10:55:56 GMT+0800 (China Standard Time)

Thanks for your interest.

We merged raw ASR outputs by their temporal intervals.

For example:
100s-108s, 109s-113s, 118s-124s → 100s-113s, 118s-124s

The interval threshold ranges from 0 to 3 seconds, adjusted with clip length, to avoid creating excessively long paragraphs.

Feel free to ask if you need further details or have any more questions :)

changqinyao · Answer 2 · Wed Jul 17 2024 11:25:52 GMT+0800 (China Standard Time)

Thanks for your interest.

We merged raw ASR outputs by their temporal intervals.

For example: 100s-108s, 109s-113s, 118s-124s → 100s-113s, 118s-124s

The interval threshold ranges from 0 to 3 seconds, adjusted with clip length, to avoid creating excessively long paragraphs.

Feel free to ask if you need further details or have any more questions :)

Is the clip-level narration generated by ASR inputting LLM? And the paragraph-level is the clip merge. Can generate narration using only ASR without video content information?

Zihao Yue · Answer 3 · Wed Jul 17 2024 11:40:02 GMT+0800 (China Standard Time)

Is the clip-level narration generated by ASR inputting LLM? And the paragraph-level is the clip merge. Can generate narration using only ASR without video content information?

No. The narration data is obtained from the barrier-free channel of ixigua, where ground truth audio description of movies are provided. For example:
夏洛特烦恼（无障碍版）

changqinyao · Answer 4 · Wed Jul 17 2024 17:45:11 GMT+0800 (China Standard Time)

Is the clip-level narration generated by ASR inputting LLM? And the paragraph-level is the clip merge. Can generate narration using only ASR without video content information?

No. The narration data is obtained from the barrier-free channel of ixigua, where ground truth audio description of movies are provided. For example: 夏洛特烦恼（无障碍版）

Hello, I would like to ask about the current upper limit of this field. How big is the gap between the narration obtained by GPT-4V according to your method and the annotated narration? For example, for a new movie, I have a narration written by the author himself. Can each sentence of this narration be correctly matched with the narration generated by GPT-4V? Is the effect of GPT-4V still far from reaching the requirements of L1 and L2 levels?

Zihao Yue · Answer 5 · Wed Jul 17 2024 22:09:18 GMT+0800 (China Standard Time)

Current models are still a long way from deployable. Generally, they are only capable of recognizing basic visual concepts in videos, perhaps partially achieving L1, but are still far from L2. We provide some qualitative results of the baseline models in Fig. 11 and Fig. 14, which offer a glimpse into their overall performance. Maybe current LVLMs are limited to describing simple visual events with distinct features, and are incapable of handling the more delicate and complex "visual stories" in movies.

changqinyao · Answer 6 · Mon Aug 05 2024 19:16:39 GMT+0800 (China Standard Time)

Current models are still a long way from deployable. Generally, they are only capable of recognizing basic visual concepts in videos, perhaps partially achieving L1, but are still far from L2. We provide some qualitative results of the baseline models in Fig. 11 and Fig. 14, which offer a glimpse into their overall performance. Maybe current LVLMs are limited to describing simple visual events with distinct features, and are incapable of handling the more delicate and complex "visual stories" in movies.

Thanks for your reply. I want to know how to design the prompt after adding the actor's portrait photo in front of the video frame? Is it necessary to indicate that the first few pictures are portraits, and then the video starts? Or is the question the same as before?

Zihao Yue · Answer 7 · Mon Aug 05 2024 19:27:49 GMT+0800 (China Standard Time)

Thanks for your reply. I want to know how to design the prompt after adding the actor's portrait photo in front of the video frame? Is it necessary to indicate that the first few pictures are portraits, and then the video starts? Or is the question the same as before?

Yes. We indicate the portraits, character names, and video frames through prompting. In some cases, due to model limitations (e.g. GPT-4V), we display the character names in the corresponding input portraits (i.e. the model gets character names by OCR). We plan to include more details about baseline implementation in our revised paper.

changqinyao · Answer 8 · Mon Aug 05 2024 21:44:02 GMT+0800 (China Standard Time)

Thanks for your reply. I want to know how to design the prompt after adding the actor's portrait photo in front of the video frame? Is it necessary to indicate that the first few pictures are portraits, and then the video starts? Or is the question the same as before?

Yes. We indicate the portraits, character names, and video frames through prompting. In some cases, due to model limitations (e.g. GPT-4V), we display the character names in the corresponding input portraits (i.e. the model gets character names by OCR). We plan to include more details about baseline implementation in our revised paper.

Is there a specific example? For example, videochat2. The input prompt is: I input a series of pictures, the first n frames are portraits of people's faces, followed by continuous video frames, the portraits are in the order: (character 1, character 2, ...), please answer my question? Is this right?

Zihao Yue · Answer 9 · Mon Aug 05 2024 21:48:50 GMT+0800 (China Standard Time)

Is there a specific example? For example, videochat2. The input prompt is: I input a series of pictures, the first n frames are portraits of people's faces, followed by continuous video frames, the portraits are in the order: (character 1, character 2, ...), please answer my question? Is this right?

You are right.

Our prompt for VideoChat-2: "你是一个电影解说员，请结合前5张图中的角色肖像（character 1, character 2, ...），观察后续16帧电影片段截图，解说该片段。"