This is the official repository of 'Multi-Stage Synergistic Aggregation Network for Remote Sensing Visual Grounding'.
This project contains a method that leverages cross-attention and query channel broadcasting as two fusion kernels involving both queries in the Multi-Stage Synergistic Aggregation Module (MSAM) with Swin transformer and GPT-like generative manner.
The best models and ablation study models are available in Google Drive. The ablation study code branches will be gradually open-sourced in this repo.
docker pull waynamigo/msam:py38t1.9
- install pytorch and torchvision
conda install pytorch==1.9.1 torchvision==0.10.1 -c pytorch
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
- install other requirements
pip install requirements.txt
- openmin install mmmcv(v1.7.1)
mim install mmcv-full==1.7.1
- if there's unexpected error like FormatCode or sth:
try downgrade or upgrade deps.
eg. TypeError: FormatCode() got an unexpected keyword argument 'verify'
downgrade yapf==0.40.2 to 0.40.1
- The preparing the DIOR-RSVG dataset.
- Run
bash scripts/xml2instances.py
to generate available format for our dataset preparation. The prepared data dir tree is this:
└── annotations(origion xml dataset from DIOR-RSVG)
│ └── rsvgd
│ ├── instances.json
│ ├── ix_to_token.pkl
│ ├── token_to_ix.pkl
│ └── word_emb.npz
├── images
│ └── rsvgd
└── weights
├── darknet.weights
├── yolov3.weights
└── detr-r50.pth
The following is an example of model training on the RefCOCOg dataset.
python tools/train.py configs/msam/detection/msam_rsvgd.py --cfg-options ema=True
We train the model on 3090 with a total batch size of 16 for 80 epochs, occupying a minimum of 18GB of VRAM.
Run the following script to evaluate the trained model with a single GPU.
python tools/test.py <config-path> --load-from <model-path>
python tools/test.py models/20230520_120410_qb_ca_mixed/20230520_120410_qb_ca_mixed.py --load-from work_dir/20230520_120410_qb_ca_mixed/latest.pth
Part of our code is based on the previous works Swin and SeqTR and SKNet