ControlNet / LAV-DF

[CVIU] Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization

Home Page:https://www.sciencedirect.com/science/article/pii/S1077314223001984

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Frame-level processing

javadmozaffari opened this issue · comments

Hello,

In this new Temporal Forgery Localization model, the entire video is used as input. The existing model proposed by the authors has demonstrated promise in achieving accurate results. Although the current input strategy consists of the entire video, it may pose a challenge in terms of memory consumption, especially for large datasets or videos with high resolution frames. Would it be possible to modify the Temporal Forgery Localization model to accept individual frames instead of the entire video? This would result in a reduction in the amount of RAM required.

Hi, sorry for the late reply.

I think it might be hard, because the boundary matching mechanism require all frames as the input. But for saving the memory, I think you can try 2 ways to reduce the temporal size.

  1. Strided sample frames for each video. For example, only use the 1st, 3rd, 5th, 7th, ... frames, so you will have less frame counts.
  2. Interpolate the temporal axis for a fixed value for each video. For example, no matter the length of the video is, resize to 100 frames.

But the pretrained model is not trained with this preprocessing, so it might not perform well if you want to use this way to evaluate.