tpxbps/visionary

Overview

This repository is the implementation for the paper "VISIONARY: Vision-aware enhancement with reminding scenes generated by captions via multimodal transformer for embodied referring expression".

Authors: Zhengwu Yuan, Peixian Tang, Xinguang Sang, Fan Zhang, Zheqi Zhang

Requirements

Please follow the baseline work DUET and LAD to complete the environment preparation and download of related data.
Download the additional generated content for VISIONARY from here, and put the data in datasets directory.

Training

After configuring the training strategy, run the following script to train:

cd training_src
sh scripts/final_frt_gd_finetuning_stable.sh

Evaluation

Replace resumedir in eval.sh and run this script to evaluate the model. Furthermore, the result file could be submitted to the online leaderboard to get the test performance.

cd training_src
sh scripts/eval.sh

P.S. The final checkpoints of VISIONARY model can be found here.

Acknowledgement

The code is mainly based on LAD, DUET, and this work is inspired by PanoGen, KERM. Thanks for their awesome works!

tpxbps / visionary

Overview

Requirements

Training

Evaluation

Acknowledgement

About

Languages