mpalaourg / CA-SUM

A PyTorch Implementation of CA-SUM from "Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames", Proc. ACM ICMR 2022

Home Page:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames

PyTorch Implementation of CA-SUM [Paper] [DOI] [Cite]

  • From "Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames".
  • Written by Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris and Ioannis Patras.
  • This software can be used for training a deep learning architecture which estimates frames' importance by integrating a concentrated attention mechanism and utilizing information about the frames' uniqueness and diversity. The integrated mechanism is able to focus on non-overlapping blocks in the main diagonal of the attention matrix and make better estimates about the significance of different parts of the video by considering the uniqueness and diversity of the associated frames. Training is performed in an unsupervised manner without knowledge of any ground-truth data. Finally, after being trained on a collection of videos, the CA-SUM model is capable of producing summaries for unseen videos, according to a user-specified time-budget about the summary duration.

Main dependencies

Developed, checked and verified on an Ubuntu 20.04.3 PC with an NVIDIA RTX 2080Ti GPU and an i5-11600K CPU. Main packages required:

Python PyTorch CUDA Version cuDNN Version TensorBoard TensorFlow NumPy H5py
3.8(.8) 1.7.1 11.0 8005 2.4.0 2.4.1 1.20.2 2.10.0


Structured h5 files with the video features and annotations of the SumMe and TVSum datasets are available within the data folder. The GoogleNet features of the video frames were extracted by Ke Zhang and Wei-Lun Chao and the h5 files were obtained from Kaiyang Zhou. These files have the following structure:

    /features                 2D-array with shape (n_steps, feature-dimension)
    /gtscore                  1D-array with shape (n_steps), stores ground truth importance score (used for training, e.g. regression loss)
    /user_summary             2D-array with shape (num_users, n_frames), each row is a binary vector (used for test)
    /change_points            2D-array with shape (num_segments, 2), each row stores indices of a segment
    /n_frame_per_seg          1D-array with shape (num_segments), indicates number of frames in each segment
    /n_frames                 number of frames in original video
    /picks                    positions of sub-sampled frames in original video
    /n_steps                  number of sub-sampled frames
    /gtsummary                1D-array with shape (n_steps), ground truth summary provided by user (used for training, e.g. maximum likelihood)
    /video_name (optional)    original video name, only available for SumMe dataset

Original videos and annotations for each dataset are also available in the dataset providers' webpages:


Setup for the training process:

  • In, specify the path to the h5 file of the used dataset, and the path to the JSON file containing data about the utilized data splits.
  • In, define the directory where the analysis results will be saved to.

Arguments in

Parameter name Description Default Value Options
--mode Mode for the configuration. 'train' 'train', 'test'
--verbose Print or not training messages. 'false' 'true', 'false'
--video_type Used dataset for training the model. 'SumMe' 'SumMe', 'TVSum'
--input_size Size of the input feature vectors. 1024 int > 0
--block_size Size of the blocks utilized inside the attention matrix. 60 0 < int ≤ 60
--init_type Weight initialization method. 'xavier' None, 'xavier', 'normal', 'kaiming', 'orthogonal'
--init_gain Scaling factor for the initialization methods. √2 None, float
--n_epochs Number of training epochs. 400 int > 0
--batch_size Size of the training batch, 20 for 'SumMe' and 40 for 'TVSum'. 20 0 < int ≤ len(Dataset)
--seed Chosen number for generating reproducible random numbers. 12345 None, int
--clip Gradient norm clipping parameter. 5 float
--lr Value of the adopted learning rate. 5e-4 float
--l2_req Value of the weight regularization factor. 1e-5 float
--reg_factor Value of the length regularization factor. 0.6 0 < float ≤ 1
--split_index Index of the utilized data split. 0 0 ≤ int ≤ 4


To train the model using one of the aforementioned datasets and for a number of randomly created splits of the dataset (where in each split 80% of the data is used for training and 20% for testing) use the corresponding JSON file that is included in the data/splits directory. This file contains the 5 randomly-generated splits that were utilized in our experiments.

For training the model using a single split, run:

for sigma in $(seq 0.5 0.1 0.9); do
    python model/ --split_index N --n_epochs E --batch_size B --video_type 'dataset_name' --reg_factor '$sigma'

where, N refers to the index of the used data split, E refers to the number of training epochs, B refers to the batch size, dataset_name refers to the name of the used dataset, and $sigma refers to the length regularization factor, a hyper-parameter of our method that relates to the length of the generated summary.

Alternatively, to train the model for all 5 splits, use the and/or script and do the following:

chmod +x model/    # Makes the script executable.
chmod +x model/    # Makes the script executable.
./model/           # Runs the script. 
./model/           # Runs the script.  

Please note that after each training epoch the algorithm performs an evaluation step, using the trained model to compute the importance scores for the frames of each video of the test set. These scores are then used by the provided evaluation scripts to assess the overall performance of the model.

The progress of the training can be monitored via the TensorBoard platform and by:

  • opening a command line (cmd) and running: tensorboard --logdir=/path/to/log-directory --host=localhost
  • opening a browser and pasting the returned URL from cmd.

Model Selection and Evaluation

The selection of a well-trained model is based on a two-step process. First, we keep one trained model per considered value for the length regularization factor sigma, by selecting the model (i.e., the epoch) that minimizes the training loss. Then, we choose the best-performing model (i.e., the sigma value) for a given data split through a mechanism that involves a fully-untrained model of the architecture and is based on transductive inference. More details about this assessment can be found in Section 4.2 of our work. To evaluate the trained models of the architecture and automatically select a well-trained one, define:

and run via

sh evaluation/ '$exp_num' '$dataset' '$eval_method'

where, $exp_num is the number of the current evaluated experiment, $dataset refers to the dataset being used, and $eval_method describe the used approach for computing the overall F-Score after comparing the generated summary with all the available user summaries (i.e., 'max' for SumMe and 'avg' for TVSum).

For further details about the adopted structure of directories in our implementation, please check line #7 and line #13 of

Trained models and Inference

We have released the trained models for our proposed method. The script, lets you evaluate the -reported- trained models, for our 5 randomly-created data splits. Firstly, download the trained models, with the following script:

sudo apt-get install unzip wget
wget "" -O
unzip -d inference
rm -f

Then, specify the PATHs for the model, the split_file, the dataset and the annotations about the frames' importance in use. Finally, run the script with the following syntax

python inference/ --dataset 'dataset_name'

where, dataset_name refers to the name of the used dataset.


If you find our work, code or pretrained models, useful in your work, please cite the following publication:

E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, "Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames", Proc. of the 2022 Int. Conf. on Multimedia Retrieval (ICMR ’22), June 2022, Newark, NJ, USA.


author = {Apostolidis, Evlampios and Balaouras, Georgios and Mezaris, Vasileios and Patras, Ioannis},
title = {Summarizing Videos Using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames},
year = {2022},
isbn = {9781450392389},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {},
doi = {10.1145/3512527.3531404},
pages = {407-415},
numpages = {9},
keywords = {frame diversity, frame uniqueness, concentrated attention, unsupervised learning, video summarization},
location = {Newark, NJ, USA},
series = {ICMR '22}


Copyright (c) 2022, Evlampios Apostolidis, Georgios Balaouras, Vasileios Mezaris, Ioannis Patras / CERTH-ITI. All rights reserved. This code is provided for academic, non-commercial use only. Redistribution and use in source and binary forms, with or without modification, are permitted for academic non-commercial use provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation provided with the distribution.

This software is provided by the authors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the authors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.


This work was supported by the EU Horizon 2020 programme under grant agreements H2020-832921 MIRROR and H2020-951911 AI4Media.


A PyTorch Implementation of CA-SUM from "Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames", Proc. ACM ICMR 2022



Language:Python 96.4%Language:Shell 3.6%