e-apostolidis / CA-SUM

A PyTorch Implementation of CA-SUM from "Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames", Proc. ACM ICMR 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How can I preprocess the frames in video? Can you let me know?

thswodnjs3 opened this issue · comments

Hello, I'm the student who is interested in Video Summarization task.

Firstly, I'm really impressed your paper! I think it is really perfect paper!

However, I've read some other papers related to Video Summarization task, but I haven't found out is there any preprocessing steps needed before inputing frames of video into GoogleNet.

Did I have to just input frame as it is into GoogleNet or there any other preprocessing steps needed before inputing? (such as resize, centercrop, mean and std value if I have to use normalize)

I want to study Video Summarization task more, but this is the most difficult part for me, because I got different f-score when I used features made by myself, comparing to f-score written in the paper. (not only ca-sum model, but also any other video summarization models from other papers)

After I read a issue in '#1', I thought the performance drop was because of my GPU.

But, I got different f-score when I tested on different preprocessing methods.
(when I used centercrop and normalize then I got 49.7989 on SumMe, but
when I used resize,centercrop and normalize then I got 44.4729 on Sume). So I think there must be specific preprocessing steps when I input frames into GoogleNet.

I hope you will answer my question ...

I'm not good at English so if there is something you can't understnad, please let me know.
Thanks

Hi,

thanks for your interest in our work. Most works in the literature are based on a set of extracted deep features using the pool5 layer of GoogleNet. These features were extracted by Ke Zhang and Wei-Lun Chao and are the ones stored in the h5 files that are available here: https://github.com/e-apostolidis/CA-SUM/tree/main/data

However, different versions/instances of GoogleNet (e.g. for PyTorch, Tensorflow, Keras) would result in different sets of feature vectors, and the GoogleNet model that was initially used for feature extraction is not known. So, to allow fair comparisons with existing methods, most works utilize the feature vectors that are available in the aforementioned h5 files.

If you need to extract new feature vectors using GoogleNet, then you should:

  • load the video
    cap = cv2.VideoCapture(video)
  • perform an iterative process that gets the video frames (keep in mind that most commonly, in the summarization domain, the video frames are sampled in order to keep 2fps)
    while cap.isOpened()
    frame = cap.read()
    • apply color transformation to meet the specifications of GoogleNet
      frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    • apply frame resizing to meet the specifications of GoogleNet (desired_size=(224, 224))
      frame = cv2.resize(frame, desired_size)

Finally, from our experience, many different factors might affect the model's training and the summarization performance (related to both software and hardware specification). So, when aiming to re-produce some experimental results, try to use the experimental settings (e.g. data splits, initialization methods, seed values) and the software components (e.g. PyTorch 1.8) reported in each work.

Thanks you so much about your kind reply!
Your answer helps me so much understanding Video Summarization deeply.
Referring to your advice, I will keep researching Video Summarization.
Thanks again and have a good day!