Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement

This repository is the official implementation of Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement (NeurIPS 2020). It is designed for semi-supervised video object segmentation (VOS) task.

[NeurIPS Page] [Paper] [Supplementary]

Paper corrections: Our feature map generated by the encoders has 1024 channels and 1/16 of the original image size.

1. Requirements

We built and tested the repository on Python 3.6.9 and Ubuntu 18.04 with one NVIDIA 1080Ti card (11GB Memory). Run on Windows or Mac is possible with minor modifications. An NVIDIA GPU card and CUDA environment are required. To install requirements, run:

pip3 install -r requirements.txt

Install the package torch_scatter by the official instructions. Our version is 2.0.4.

2. Evaluation

DAVIS17-TrainVal

Download and extract DAVIS17-TrainVal dataset.
Download the pretrained DAVIS17 checkpoint.
run:

python3 eval.py --level 1 --resume /path/to/checkpoint.pth/ --dataset /path/to/dir/

To reproduce the segmentation scores, you can use the official evaluation tool from DAVIS benchmark.

YouTube-VOS18

Download and extract YouTube-VOS18 dataset.
Download the pretrained YouTube-VOS18 checkpoint.
run:

python3 eval.py --level 2 --resume /path/to/checkpoint.pth/ --dataset /path/to/dir/ --update-rate 0.05

Attention: Directly submit our results to the YouTube-VOS codalab for evaluation will pollute the leader board. We encourage you to submit your own results.

Long Videos

Download and extract Long Videos dataset.
Download the pretrained YouTube-VOS18 checkpoint above.
run:

python3 eval.py --level 3 --resume /path/to/checkpoint.pth/ --dataset /path/to/dir/ --update-rate 0.05

To reproduce the segmentation scores, you can use the same tool from the DAVIS benchmark.

Your Own Video

Prepare your video frames and the first frame annotation following the data structure of the long videos page. You can see the data structure without download it and you only need to provide the first frame annotation.

Run the same parameters as the long videos setting.

Options for Evaluation

--gpu: GPU id to run (default: 0).
--viz: Enable output overlays along with the estimated masks (default: False).
--budget: The number of features that can be stored in total (default: 300000 for 1080Ti).

By default, the segmentation results will be saved in ./output.

3. Training

Pre-training on Static Images

Download the following the datasets (COCO is the largest one). You don't have to download all, our pretrain codes skip datasets that don't exist by default.
Run unify_pretrain_dataset.py to convert them into a uniform format (followed DAVIS).

python3 unify_pretrain_dataset.py --name NAME --src /path/to/dataset/dir/ --dst /path/to/output

MSRA10K: --name MSRA10K
ECSSD: --name ECSSD
PASCAL-S: --name PASCAl-s
PASCAL VOC2012: --name PASCALVOC2012
COCO: --name COCO. API pycocotools is required.

You may need minor modifications in the dataset path. Descriptions of useful options,

--palette: Path to the palette image. We provide a template in assets/mask_palette.png, followed the formats of DAVIS17.
--workder: The parallel threads number to accelerate the procedures (Default: 20).

After the conversion process, you can start pre-training the model:

python3 train.py --level 0 --dataset /path/to/pretrain/ --lr 1e-5 --scheduler-step 3 --total-epoch 12 --log

Pre-training process may takes days to weeks, you can download our checkpoint to save time.

Training on DAVIS17

Download the semi-supervised TrainVal 480p from the DAVIS website. Run

python3 train.py --level 1 --new --resume /path/to/PreTrain/checkpoint.pth --dataset /path/to/DAVIS17/ --lr 4e-6 --scheduler-step 200 --total-epoch 1000 --log

Training on YouTube-VOS

Download training set of the YouTube-VOS dataset. Run

python3 train.py --level 2 --new --resume /path/to/PreTrain/checkpoint.pth --dataset /path/to/YouTubeVOS/train --lr 4e-6 --scheduler-step 30 --total-epoch 150 --log

4. License

This repository is released for academic use only. If you want to use our codes for commercial products, please contact xinli@cct.lsu.edu in advance. If you use our codes, please cite our paper,

@inproceedings{NEURIPS2020_liangVOS,
 author = {Liang, Yongqing and Li, Xin and Jafari, Navid and Chen, Jim},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
 pages = {3430--3441},
 publisher = {Curran Associates, Inc.},
 title = {Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement},
 url = {https://proceedings.neurips.cc/paper/2020/file/234833147b97bb6aed53a8f4f1c7a7d8-Paper.pdf},
 volume = {33},
 year = {2020}
}

5. Update Logs

2022/04/24 Update the evaluation script for long video benchmark.

xmlyqing00 / AFB-URR