Frame-level processing

Question

Frame-level processing

javadmozaffari opened this issue 8 months ago · comments

Hello,

In this new Temporal Forgery Localization model, the entire video is used as input. The existing model proposed by the authors has demonstrated promise in achieving accurate results. Although the current input strategy consists of the entire video, it may pose a challenge in terms of memory consumption, especially for large datasets or videos with high resolution frames. Would it be possible to modify the Temporal Forgery Localization model to accept individual frames instead of the entire video? This would result in a reduction in the amount of RAM required.

ControlNet · Answer 1 · Thu Nov 23 2023 23:25:56 GMT+0800 (China Standard Time)

Hi, sorry for the late reply.

I think it might be hard, because the boundary matching mechanism require all frames as the input. But for saving the memory, I think you can try 2 ways to reduce the temporal size.

Strided sample frames for each video. For example, only use the 1st, 3rd, 5th, 7th, ... frames, so you will have less frame counts.
Interpolate the temporal axis for a fixed value for each video. For example, no matter the length of the video is, resize to 100 frames.

But the pretrained model is not trained with this preprocessing, so it might not perform well if you want to use this way to evaluate.