Referring Attention Learning for Weakly Supervised Temporal Grounding

Weakly supervised learning for temporal grounding, including action/activity localization and moment localization in videos. Top-down signal guided attention, or referring attention is novel; Losses should be novel (still developing); the use of Gumbel in the training is novel in term of binary attention.

STPN: Weakly Supervised Action Localization by Sparse Temporal Pooling Network

This is a re-implement of the paper Weakly Supervised Action Localization by Sparse Temporal Pooling Network, from Phuc, Google, CVPR 2018.

I3D features:

Create a feature directory and pour in I3D features provided from Sujoy Paul, UC, Riverside I3D features on Thumos 2014 (Optical flow obtained in 10 fps, I3D model pre-trained on kinetics)

Precision Updates on Thumos 2014 Challenges

Our baseline is not strong enough as people care more about IoU=0.5

map	IoU=0.1	0.2	0.3	0.4	0.5
TCAM-paper	52.00	44.70	35.50	25.80	16.9
Ours-TCAM	40.96	37.65	26.27	16.98	10.39
Ours_Margin_Loss	47.91	39.08	28.66	20.14	13.26

Other Implementation and Results

map	IoU=0.1	0.2	0.3	0.4	0.5
Demian Zhang-PT	40.80	34.00	26.90	20.50	14.40
JaeYoo Park-TF	52.10	44.20	34.70	26.10	17.70

Files

Training/Testing.ipynb: Files for T-CAM implementation and validation.

Training_LSTM.ipynb: Ours margin loss model based on T-CAM.

Training_Gumbel_LSTM.ipynb: In a mess and still working on it, please ignore it. :)

jacobswan1 / Weakly-Supervised-Action-Localization-by-Sparse-Temporal-Pooling-Network