lyuchenyang / Dialogue-to-Video-Retrieval

Code for ECIR 2023 paper "Dialogue-to-Video Retrieval"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dialogue-to-Video Retrieval ๐ŸŽฅ๐Ÿ’ฌ

Chenyang Lyu, Manh-Duy Nguyen, Van-Tu Ninhโ€ , Liting Zhou, Cathal Gurrin, Jennifer Foster

School of Computing, Dublin City University, Dublin, Ireland ๐Ÿซ

โ€  The first three authors contributed equally. ๐Ÿค

This repository contains the code for ECIR paper Dialogue-to-Video Retrieval, which proposed a novel approach for retrieving videos based on dialogue queries. The system incorporates structured conversational information to improve retrieval performance. ๐Ÿ’ก

Table of Contents ๐Ÿ“‘

1. Introduction ๐Ÿ“

Recent years have witnessed an increasing amount of dialogue/conversation on the web, especially on social media. This has inspired the development of dialogue-based retrieval systems. In the case of dialogue-to-video retrieval, videos are retrieved based on user-generated dialogue queries. This approach utilizes structured conversational information to improve the accuracy of video recommendations. ๐ŸŒ

This repository presents a novel dialogue-to-video retrieval system that incorporates structured conversational information. Experimental results on the AVSD dataset demonstrate the superiority of our approach over previous models, achieving significant improvements in retrieval performance. ๐Ÿ“ˆ

2. Dataset ๐Ÿ“š

To run the system, you need to download the AVSD dataset. The dataset is available at the following links:

In addition, you also need to download the original videos from the Charades dataset. The videos can be downloaded from this link.

Please put the downloaded videos and dataset into the directory "data/avsd/". ๐Ÿ“‚

3. Pre-processing ๐Ÿ”„

Before training the dialogue-to-video retrieval model, you need to pre-process the AVSD dataset. To do this, run the following command:

python data_preprocess.py

This script will extract video frames and audios from the AVSD videos and process the dataset into a tensor dataset. ๐ŸŽž๏ธ๐Ÿ”‰

4. Training ๐Ÿš€

To train the dialogue-to-video retrieval model, you can use the provided script. Note that the script is expected to run on a server with at least 4 GPUs (ideally NVIDIA A100). The ideal batch size is 16 in total, so if running on a 4-GPU server, the batch size for each GPU should be 4.

python run_dialogue_to_video_retrieval.py --do_train --do_eval --num_train_epochs 5 --n_frames 12 --learning_rate 1e-5 --train_batch_size 4 --eval_batch_size 16 --attention_heads 8 --eval_steps 100000 --n_gpu 4 --image_dir data/avsd/frames/ --clip_model_name openai/clip-vit-base-patch16 --clip_processor_name ViT-B/16

5. Usage ๐Ÿ’ป

Once the model is trained, you can use it for dialogue-to-video retrieval. Provide a dialogue query, and the system will retrieve the most relevant videos based on the query. ๐Ÿ”Ž

6. Dependencies ๐Ÿ› ๏ธ

  • Python (>=3.8) ๐Ÿ
  • Pytorch (>=2.0) ๐Ÿ”ฅ
  • NumPy ๐Ÿงฎ
  • Pandas ๐Ÿผ

Please make sure to install the required dependencies before running the code. โš™๏ธ

Citation ๐Ÿ“„

Please cite our paper using the bibtex below if you found that our paper is useful to you:

@inproceedings{lyu2023dialogue,
  title={Dialogue-to-Video Retrieval},
  author={Lyu, Chenyang and Nguyen, Manh-Duy and Ninh, Van-Tu and Zhou, Liting and Gurrin, Cathal and Foster, Jennifer},
  booktitle={Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part II},
  pages={493--501},
  year={2023},
  organization={Springer}
}

About

Code for ECIR 2023 paper "Dialogue-to-Video Retrieval"

License:Apache License 2.0


Languages

Language:Python 100.0%