Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval (CVPR 2024 Highlight)

Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao, "Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval".

[Paper] [Pretrained Models]

Abstract: The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3%~6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.

Dependencies

PyTorch 1.12.1
OpenCV 4.7.0
transformers 4.30.2

Dataset

To download MSRVTT, LSMDC, and DiDeMo, please follow CLIP4Clip.

You will need to request a permission from MPII to download and use the Standard LSMDC data.

For LSMDC, download the data split csv files into ./data/LSMDC/.

For DiDeMo, using gdrive to download video data is recommended. One may consider

Setup gdrive by following "Getting started".
Download video data by gdrive files download --recursive FOLDER_ID_FROM_URL

Dataset	Video Clips	Text-Video Pairs	Scale	Link
MSR-VTT	10K	one-to-twenty	6.7Gb	link
LSMDC	118081	one-to-one	1.3Tb	link
DiDeMo	10464	one-to-many	581Gb	link

Evaluation

Download the checkpoints into ./outputs/{Dataset}/{FOLDER_NAME_UNDER_*Dataset*}.

Repeat testing process for --stochasic_trials causes either time or memory computational overhead. The sequential strategy provided is more memory-friendly. We adopt --seed=24 and --stochasic_trials=20 for all methods. One may consider specifying --save_memory_mode for larger datasets or computational-constrained platforms at evaluation. Same as XPool, the evaluation is default to text-to-video retrieval performance (i.e., --metric=t2v), for video-to-text retrieval performance, specify --metric=v2t. For post processing operation evaluation results of DSL, specify --DSL.

Replace {videos_dir} with the path to the dataset.

Dataset	Command	Checkpoint File	t2v R@1 Result
MSR-VTT-9k	`python test.py --datetime={FOLDER_NAME_UNDER_MSR-VTT-9k} --arch=clip_stochastic --videos_dir={VIDEO_DIR} --batch_size=32 --noclip_lr=3e-5 --transformer_dropout=0.3 --dataset_name=MSRVTT --msrvtt_train_file=9k --stochasic_trials=20 --gpu='0' --load_epoch=0 --exp_name=MSR-VTT-9k`	Link	50.2
LSMDC	`python test.py --arch=clip_stochastic --exp_name=LSMDC --videos_dir={VIDEO_DIR} --batch_size=32 --noclip_lr=1e-5 --transformer_dropout=0.3 --dataset_name=LSMDC --stochasic_trials=20 --gpu='0' --num_epochs=5 --stochastic_prior=normal --stochastic_prior_std=3e-3 --load_epoch=0 --datetime={FOLDER_NAME_UNDER_LSMDC}`	Link	28.9
DiDeMo	`python test.py --num_frame=12 --raw_video --arch=clip_stochastic --exp_name=DiDeMo --videos_dir={VIDEO_DIR} --batch_size=32 --noclip_lr=1e-5 --transformer_dropout=0.4 --dataset_name=DiDeMo --stochasic_trials=20 --gpu='0' --num_epochs=5 --load_epoch=0 --datetime={FOLDER_NAME_UNDER_DiDeMo}`	Link	50.9

Training

Run the following training code to resume the above results. Take MSRVTT as an example, one may consider support text regularization by specifying --support_loss_weight. --evals_per_epoch can be enlarged to select a better checkpoint. The CLIP model is default to --clip_arch=ViT-B/32. To train on a larger CLIP backbone, speficy --clip_arch=ViT-B/16. One may enlarge the training epochs --num_epochs by one or two when the dataset is incomplete for a better performance.

Dataset	Command
MSR-VTT-9k	`python train.py --arch=clip_stochastic --exp_name=MSR-VTT-9k --videos_dir={VIDEO_DIR} --batch_size=32 --noclip_lr=3e-5 --transformer_dropout=0.3 --dataset_name=MSRVTT --msrvtt_train_file=9k --stochasic_trials=20 --gpu='0' --num_epochs=5 --support_loss_weight=0.8`
LSMDC	`python train.py --arch=clip_stochastic --exp_name=LSMDC --videos_dir={VIDEO_DIR} --batch_size=32 --noclip_lr=1e-5 --transformer_dropout=0.3 --dataset_name=LSMDC --stochasic_trials=20 --gpu='0' --num_epochs=5 --stochastic_prior=normal --stochastic_prior_std=3e-3`
DiDeMo	`python train.py --num_frame=12 --raw_video --arch=clip_stochastic --exp_name=DiDeMo --videos_dir={VIDEO_DIR} --batch_size=32 --noclip_lr=1e-5 --transformer_dropout=0.4 --dataset_name=DiDeMo --stochasic_trials=20 --gpu='0' --num_epochs=5`

Citation

If you find this work valuable for your research, we kindly request that you cite the following paper:

@inproceedings{wang2024text,
  title={Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval}, 
  author={Wang, Jiamian and Sun, Guohao and Wang, Pichao and Liu, Dongfang and Dianat, Sohail and Rabbani, Majid and Rao, Raghuveer and Tao, Zhiqiang},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)},
  year={2024}
}

Acknowledgement

This code is built on XPool. Great thanks to them!

Contact

For discussions, please feel free to submit an issue or contact me via email at jiamiansc@gmail.com.

Jiamian-Wang / T-MASS-text-video-retrieval