Video-aided Unsupervised Grammar Induction

We investigate video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video. We explore rich features (e.g. action, object, scene, audio, face, OCR and speech) from videos, taking the recent Compound PCFG as the baseline. We further propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities.

PDF

News

🍻 Our follow-up work is accepted by EMNLP 2022. [PDF] [Code]
🏆 Our paper wins the best long paper award at NAACL 2021.
☀️ Our paper is accepted by NAACL 2021.

Prerequisites

pytorch 1.5.0
python 3.7
easydict
terminaltables
tqdm
numpy
(A forked version) Torch-Struct
(Optional) benepar

Quick Start

Please download the data from dropbox or box, and save it to the data folder. Preprocessing details are described here.

Training

Run the following commands for training:

sh scripts/didemo/train.sh 
sh scripts/youcook2/train.sh 
sh scripts/msrvtt/train.sh

For VC-PCFG and MMC-PCFG, each run will take approximate 2 hours on DiDeMo, 1 hour on YouCook2 and 10 hours on MSRVTT with a single GTX 1080Ti or GTX 2080Ti.

Testing

Our trained model are provided in dropbox or box. Please download them to the final_checkpoints folder.

Then, run the following commands for evaluation:

sh scripts/didemo/test.sh 
sh scripts/youcook2/test.sh 
sh scripts/msrvtt/test.sh

Preprocessing Details

Download the data from CVPR 2020: Video Pentathlon challenge and save it to the data folder. After that, preprocess sentences with the following scripts:

python tools/preprocess_captions.py
python tools/compute_gold_trees.py
python tools/generate_vocabularies.py

Acknowledgements

This repo is developed based on vpcfg and detr.

Citation

If any part of our paper and code is helpful to your work, please generously cite with:

@InProceedings{zhang2021video,
author = {Zhang, Songyang and Song, Linfeng and Jin, Lifeng and Xu, Kun and Yu, Dong and Luo, Jiebo},
title = {Video-aided Unsupervised Grammar Induction},
booktitle = {NAACL},
year = {2021}
}

About

Video-aided Unsupervised Grammar Induction, NAACL‘21 [best long paper]

grammar-induction video-retrieval

MIT License

Languages

Language:Python 56.3%Language:Shell 43.7%