pzzhang / EgoVLP

[Arxiv2022] Egocentric Video-Language Pretraining

Home Page:https://arxiv.org/pdf/2206.01670.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EgoVLP: Egocentric Video-Language Pretraining

Project page | arXiv

TL;DR: We pioneer Egocentric Video-Language Pretraining from pretraining dataset, model and development benchmark; the resulted pretrained model exhibits strong performance on six downstream tasks across three egocentric datasets.

EgoVLP

πŸ“’ News

πŸ“ Preparation

You may skip this step if pretraining is not required.

Ego4D videos and metadata

  1. Follow the guideline here, download the following to {PATH_TO_EGO4D}

    • Ego4D source videos (nearly 7 TB).
    • Ego4D videos metadata manifest.csv and benchmark metadata, e.g., nlq_train.json for NLQ.
    • Create the dir ./dataset and add a soft link by ln -s {PATH_TO_EGO4D} ./dataset/ego4d.
  2. For effectively pretraining, we compress videos in the following way:

    • Resize the source videos with a short size equal to 256 by script ./utils/video_resize.py.
    • Chunk the resized videos to multiple segments (up to 600 sec) by script ./utils/video_chunk.py.

EgoClip

  • Download the EgoClip metadata from here and put it to ./dataset/egoclip.csv.

  • For the usage of EgoClip, please refer to ./data_loader/EgoClip_EgoMCQ_dataset.py. The data format of EgoClip is:

    import pandas as pd
    
    metadata = pd.read_csv('./dataset/egoclip_metadata.csv', sep='\t', error_bad_lines=False)
    print(metadata.shape[0])
    print(metadata.iloc[0])
    
    # Out:
    3847723                                                         # Num of clips for EgoClip
    
    clip_idx                                                     0  # the idx of clip
    video_uid                 001e3e4e-2743-47fc-8564-d5efd11f9e90  # the uid of source video
    video_dur                                           128.033333  # the duration of source video
    narration_source                              narration_pass_1  # the source of annotator
    narration_ind                                                0  # the idx of narration
    narration_time                                          3.3445  # the narration timestamp
    clip_start                                            2.967651  # the start timestamp of clip
    clip_end                                              3.721266  # the end timestamp of clip
    clip_text           #C C picks a bag of clothes from the floor  # the narration of clip
    tag_verb                                                  [93]  # the verb idx of the narration
    tag_noun                                        [192, 115, 12]  # the noun idx of the narration

EgoMCQ

  • Download the EgoMCQ metadata from here and put it to ./dataset/egomcq.json.
  • For the usage of EgoMCQ, please refer to ./data_loader/EgoClip_EgoMCQ_dataset.py.

πŸ‹οΈβ€οΈ Pretraining

We pretrain EgoVLP on 4 nodes, each with 8 A100 GPUs (10 epochs in about two days).

  • Train on EgoClip: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --master_addr $CHIEF_IP --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_egoclip.py --config ./configs/pt/egoclip.json

  • Test on EgoMCQ: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --master_addr $CHIEF_IP --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_egoclip.py --config ./configs/eval/egomcq.json

  • Monitor the EgoMCQ performance during pretraining: tensorboard --logdir ./results --bind_all

πŸ—„ Pretrained Weights

  • We have released our pretrained EgoVLP model (EgoClip w/ EgoNCE) in Google Drive.

πŸ”§ Downstream Tasks

EPIC-Kitchens MIR

  • Results:
Model Mode # Frames Video-Text PT Pretrained Weight mAP (V2T) mAP (T2V) mAP (Avg) nDCG (V2T) nDCG (T2V) nDCG (Avg)
EgoVLP Zero-shot 4 EgoClip w/ EgoNCE Google Driver 19.4 13.9 16.6 24.1 22.0 23.1
EgoVLP Fine-tuning w/ MI-MM 16 EgoClip w/ EgoNCE Google Driver 49.9 40.5 45.0 60.9 57.9 59.4
EgoVLP* Fine-tuning w/ Adaptive MI-MM 16 EgoClip w/ EgoNCE Google Driver 52.3 40.1 46.2 62.2 58.6 60.4
EgoVLP* ⬆️ w/ Dual-softmax 16 EgoClip w/ EgoNCE ⬆️ 53.8 40.9 47.4 63.3 59.6 61.4

(EgoVLP* means our submission for Multi-Instance Retrieval@EPIC-Kitchens Challenge 2022)

  • Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_epic.py --config ./configs/ft/epic.json

  • Test: python3 ./run/test_epic.py

Charades-Ego

  • Results:
Model Mode # Frames Video-Text PT Pretrained Weight mAP
EgoVLP Zero-shot 16 EgoClip w/ EgoNCE Google Driver 25.0
EgoVLP Fine-tuning 16 EgoClip w/ EgoNCE Google Driver 32.1
  • Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_epic.py --config ./configs/ft/charades.json

  • Test: python3 ./run/test_charades.py

NLQ

  • Extract video features: python3 ./run/test_nlq.py --subsample 'text'.
  • Extract text features: python3 ./run/test_nlq.py --subsample 'video'.
  • Fine-tune the VSLNet by replacing its input features.

MQ

  • Extract video features: python3 ./run/test_mq.py.
  • Fine-tune the VSGN by replacing its input features.

OSSC

  • Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_oscc.py --config ./configs/ft/oscc.json

PNR

  • Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_pnr.py --config ./configs/ft/pnr.json

πŸŽ“ Citation

If you find our work helps, please cite our paper.

@article{kevin2022egovlp,
	title={Egocentric Video-Language Pretraining},
	author={Kevin Qinghong Lin and Alex Jinpeng Wang and Mattia Soldan and Michael Wray and Rui Yan and Eric Zhongcong Xu and Difei Gao and Rongcheng Tu and Wenzhe Zhao and Weijie Kong and Chengfei Cai and Hongfa Wang and Dima Damen and Bernard Ghanem and Wei Liu and Mike Zheng Shou},
	journal={arXiv preprint arXiv:2206.01670},
	year={2022}
}

βœ‰οΈ Contact

This repo is maintained by Kevin. Questions and discussions are welcome via kevin.qh.lin@gmail.com.

We are willing to merge results and codes if transfer our EgoVLP to other egocentric tasks or datasets.

πŸ™ Acknowledgements

This codebase is based on Frozen.

Thanks to Alex for the help with Data Distributed Parallel implementation and Mattia for the help with NLQ and MQ benchmarks.

LICENSE

MIT

About

[Arxiv2022] Egocentric Video-Language Pretraining

https://arxiv.org/pdf/2206.01670.pdf


Languages

Language:Python 100.0%