ScaleVLN

Official implementation of the ICCV 2023 paper:
Scaling Data Generation in Vision-and-Language Navigation
Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao

Paper & Appendices

Abstract

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction-trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.

Updates

2023/07/31🔥: We release the ScaleVLN dataset, models and training codes for R2R.
2023/07/14: ScaleVLN is accpeted by ICCV2023! 🎉

TODOs

ScaleVLN dataset
Code, data and trained models for R2R
Code, data and trained models for other downstream tasks
Graph Construction Code
Speaker Training Code

Prerequisites

Installation

Install Matterport3D simulators: follow instructions here. We use the latest version instead of v0.1.

export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH

Install requirements:

conda create --name vlnde python=3.9
conda activate vlnde
cd VLN-DUET
pip install -r requirements.txt

R2R

Download the required data from here and unzip it to VLN-DUET/datasets/R2R. It should include three folders annotations, connectivity, connectivity_mp3d.
Download the CLIP and EnvEdit features from here and unzip it to VLN-DUET/datasets/R2R. It should include one folder features.
(Optional) Download the trained models from here and unzip it to VLN-DUET/datasets/R2R. It should include one folder trained_models.
Download pretrained lxmert from here and place it at VLN-DUET/datasets/pretrained/LXMERT.

Running

Pre-training

We use Two NVDIA A100 GPUs for pre-training agents on ScaleVLN.

bash run_r2r_b14.sh "0,1" 45008
bash run_r2r_h14.sh "0,1" 45009

Fine-tuning

We use one NVDIA A100 GPU for fine-tuning our agents.

bash scripts/r2r_b16_mix.sh 0
bash scripts/r2r_h14_envedit_mix.sh 0
...

Citation

Please cite our paper:

@InProceedings{wang2023scalevln,
    author    = {Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao},
    title     = {Scaling Data Generation in Vision-and-Language Navigation},
    booktitle = {ICCV 2023},
    year      = {2023}
}

Acknowledgement

We thank the developers of DUET, EnvDrop, Co-Mod GAN, Discrete-Continuous VLN, HAMT for their public code release.

wz0919 / ScaleVLN