Feint6K Dataset

Feint6K dataset for video-text understanding, from the following paper:

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

Wufei Ma^1,2 Kai Li¹ Zhongshi Jiang¹ Moustafa Meshry¹ Qihao Liu²
Huiyu Wang³ Christian Häne¹ Alan Yuille²

¹Meta Reality Labs ²Johns Hopkins University ³Meta AI

ECCV 2024 Project Page arXiv

We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset, to better assess the capabilities of current video-text models and understand their limitations. To succeed on our new task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented data and are far behind human-level performance.

From our experiments on RCAD, we identify a key limitation of current contrastive approaches on video-text data and introduce LLM-teacher, a more effective approach to learn action semantics by leveraging knowledge obtained from a pretrained large language model.

Data Preparation

Download Feint6K data (.csv files with counterfactually augmented captions) from here.
Download video data for MSR-VTT and VATEX to a video data folder, e.g., ./videos:
```
./videos
  |- msrvttvideo
  |   |- *.mp4
  |- vatexvideo
      |- *.mp4
```

Example RCAD Evaluation on Feint6K Dataset

Compute video-text similarity matrix, e.g., with LanguageBind. Similarity matrices will be saved to sim_mat_msrvtt.npy and sim_mat_vatex.npy for RCAD on MSR-VTT and VATEX respectively.

# install and activate conda environment for LanguageBind
# see: https://github.com/PKU-YuanGroup/LanguageBind?tab=readme-ov-file#%EF%B8%8F-requirements-and-installation
conda activate languagebind

python3 compute_sim_mat_languagebind.py --video_path videos

Compute RCAD metrics given the saved similarity matrix for any video-text model:
```
python3 eval_rcad.py
```
The RCAD results will be printed to the console, e.g.,
```
RCAD on msrvtt: R@1=41.7 R@3=76.5 meanR=2.4 medianR=2.0
RCAD on vatex: R@1=43.2 R@3=77.2 meanR=2.3 medianR=2.0
```

Statements

All data collection and experiments in this work were conducted at JHU.

Ethics. We follow the ethics guidelines of ECCV 2024 and obtained Institutional Review Board (IRB) approvals prior to the start of our work. We described potential risks to the annotators, such as being exposed to inappropriate videos from public video datasets, and explained the purpose of the study and how the collected data will be used. All annotators agreed to join this project voluntarily and were paid by a fair amount as required at our institution.

Citation

If you find this dataset helpful, please cite:

@inproceedings{ma2024rethinking,
  title={Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data},
  author={Ma, Wufei and Li, Kai and Jiang, Zhongshi and Meshry, Moustafa and Liu, Qihao and Wang, Huiyu and H{\"a}ne, Christian and Yuille, Alan},
  booktitle={European Conference on Computer Vision},
  year={2024},
  organization={Springer}
}

License

Fent6k is CC-BY-NC 4.0 licensed, as found in the LICENSE file.

[Terms of Use] [Privacy Policy]

vamoko / feint6k

Feint6K Dataset

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

Data Preparation

Example RCAD Evaluation on Feint6K Dataset

Statements

Citation

License

About

Languages