MrZihan/Sim2Real-VLN-3DFF

Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu and Shuqiang Jiang

Vision-and-language navigation (VLN) enables the agent to navigate to a remote location in 3D environments following the natural language instruction. In this field, the agent is usually trained and evaluated in the navigation simulators, lacking effective approaches for sim-to-real transfer. The VLN agents with only a monocular camera exhibit extremely limited performance, while the mainstream VLN models trained with panoramic observation, perform better but are difficult to deploy on most monocular robots. For this case, we propose a sim-to-real transfer approach to endow the monocular robots with panoramic traversability perception and panoramic semantic understanding, thus smoothly transferring the high-performance panoramic VLN models to the common monocular robots. In this work, the semantic traversable map is proposed to predict agent-centric navigable waypoints, and the novel view representations of these navigable waypoints are predicted through the 3D feature fields. These methods broaden the limited field of view of the monocular robots and significantly improve navigation performance in the real world. Our VLN system outperforms previous SOTA monocular VLN methods in R2R-CE and RxR-CE benchmarks within the simulation environments and is also validated in real-world environments, providing a practical and high-performance solution for real-world VLN.

Figure 1. The VLN models equipped with a monocular camera have limited navigation success rates of less than 39% on the R2R-CE Val Unseen split. Most VLN models are trained and evaluated in the simulator [6] with the panoramic observation, achieving navigation success rates of over 57%, but hard to deploy on real-world robots.

Figure 2. The sim-to-real transfer framework via semantic traversable map and 3D feature fields for vision-and-language navigation.

Requirements

Install Habitat simulator: follow instructions from ETPNav and VLN-CE.
(Optional) Download MP3D Scene Semantic Pclouds for pre-training the semantic and occupancy map predictor, following CM2.
(Optional) Download GT annotation of waypoints for pre-training the traversable map predictor, following CWP.

Install torch_kdtree for K-nearest feature search from torch_kdtree, following HNR-VLN.

git clone https://github.com/thomgrand/torch_kdtree
cd torch_kdtree
git submodule init
git submodule update
pip3 install .

Install tinycudann for faster multi-layer perceptrons (MLPs) from tiny-cuda-nn, following HNR-VLN.
```
pip3 install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
```
Download the preprocessed data and checkpoints from BaiduNetDisk or TeraBox.

(Optional) Pre-train the Semantic Traversable Map

cd Traversable_Map
bash run_r2r/main.bash train 2341

(Optional) Pre-train the 3D Feature Fields

Follow the HNR-VLN and use the CLIP-ViT-B/16 as the visual encoder.

(Optional) Pre-train the ETPNav without depth feature

Download the pretraining datasets and precomputed features from folder pretrain_src in BaiduNetDisk or TeraBox

cd ETPNav_without_depth
bash pretrain_src/run_pt/run_r2r.bash 2342

(Optional) Finetune the ETPNav without depth feature

Follow ETPNav, for R2R-CE

cd ETPNav_without_depth
bash run_r2r/main.bash train 2343

Follow ETPNav, for RxR-CE

cd ETPNav_without_depth
bash run_rxr/main.bash train 2343

Train and evaluate the monocular ETPNav with 3D Feature Fields

cd VLN_3DFF
bash run_r2r/main.bash train 2344 # training
bash run_r2r/main.bash eval 2344 # evaluation
bash run_r2r/main.bash inter 2344 # inference

cd VLN_3DFF
bash run_rxr/main.bash train 2344  # training
bash run_rxr/main.bash eval  2344  # evaluation
bash run_rxr/main.bash inter 2344  # inference

(Optional) Run in Interbotix LoCoBot WX250 for real-world VLN

Ensure the robot and the server are on the same local area network (LAN).

Fill in the Server_IP and Robot_IP correctly in Server_Code/run.py and Robot_Code/robot.py.

Run the VLN model in the server:

cd Server_Code
python3 run.py

Run the control code in the robot:

cd Robot_Code
python3 robot.py

Issues

For the training process and training speed, see Issue#2.

Citation

@article{wang2024sim,
  title={Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation},
  author={Wang, Zihan and Li, Xiangyang and Yang, Jiahao and Liu, Yeqi and Jiang, Shuqiang},
  journal={arXiv preprint arXiv:2406.09798},
  year={2024}
}

Acknowledgments

Our code is based on ETPNav, HNR-VLN and CM2. Thanks for their great works!

MrZihan / Sim2Real-VLN-3DFF