Preview version paper of this work is available at: https://arxiv.org/abs/2112.02853
Qualitative results and comparisons with previous SOTAs are available at: https://youtu.be/X6BsS3t3wnc
This repo is a preview version. More details will be added later. Welcome to starts โญ & comments ๐น & collaboration ๐ !!
- 2021.12.25 ๐
๐: Precomputed Results on YouTube-VOS18/19 and DAVIS17 Val/Test-dev are available on both GoogleDrive and BaiduDisk!
- 2021.12.14: Due to some policies in the company, the previewed-version code without checking has to be withdrawn now. Stay tuned and it will be released again after review!
Error propagation is a general but crucial problem in online semi-supervised video object segmentation. We aim to suppress error propagation through a correction mechanism with high reliability.
The key insight is to disentangle the correction from the conventional mask propagation process with reliable cues.
We introduce two modulators, propagation and correction modulators, to separately perform channel-wise re-calibration on the target frame embeddings according to local temporal correlations and reliable references respectively. Specifically, we assemble the modulators with a cascaded propagation-correction scheme. This avoids overriding the effects of the reliable correction modulator by the propagation modulator.
Although the reference frame with the ground truth label provides reliable cues, it could be very different from the target frame and introduce uncertain or incomplete correlations. We augment the reference cues by supplementing reliable feature patches to a maintained pool, thus offering more comprehensive and expressive object representations to the modulators. In addition, a reliability filter is designed to retrieve reliable patches and pass them in subsequent frames.
Our model achieves state-of-the-art performance on YouTube-VOS18/19 and DAVIS17-Val/Test benchmarks. Extensive experiments demonstrate that the correction mechanism provides considerable performance gain by fully utilizing reliable guidance.
- Python3
- pytorch >= 1.4.0
- torchvision
- opencv-python
- Pillow
You can also use the docker image below to set up your env directly. However, this docker image may contain some redundent packages. A more light-weight one will be added later.
docker image: xxiaoh/vos:10.1-cudnn7-torch1.4_v3
-
Datasets
-
YouTube-VOS
A commonly-used large-scale VOS dataset.
datasets/YTB/2019: version 2019, download link.
train
is required for training.valid
(6fps) andvalid_all_frames
(30fps, optional) are used for evaluation.datasets/YTB/2018: version 2018, download link. Only
valid
(6fps) andvalid_all_frames
(30fps, optional) are required for this project and used for evaluation. -
DAVIS
A commonly-used small-scale VOS dataset.
datasets/DAVIS: TrainVal (480p) contains both the training and validation split. Test-Dev (480p) contains the Test-dev split. The full-resolution version is also supported for training and evaluation but not required.
-
-
pretrained weights for the backbone
Training for YouTube-VOS:
sh ../scripts/ytb_train.sh
- Notice that the some training parameters need to be changed according to your hardware environment, such as the interval to save a checkpoint.
- More details will be added soon.
Using reliable object proxy augmentation (RPA)
sh ../scripts/ytb_eval_with_RPA.sh
Without using reliable object proxy augmentation (RPA):
sh ../scripts/ytb_eval_without_RPA.sh
-
For evaluation, please use official YouTube-VOS servers (2018 server and 2019 server), official DAVIS toolkit (for Val), and official DAVIS server (for Test-dev).
-
More details will be added soon.
Precomputed results on both YouTube-VOS18/19 and DAVIS17 Val/Test-dev are available on Google Drive and Baidu Disk (BaiduDisk password:6666).
Although the numbers on some semi-VOS benchmarks are somehow extremely high, many problems still remain for further exploration.
I think those who take a look at this repo are likely to be researching in the field related to segmentation or tracking.
So I would like to share some directions to explore in VOS from my point of view here. Hopefully, I can see some nice solutions in the near future!
- What about leveraging the propagation-then-correction mechanism in other tracking tasks such as MOT and pose tracking?
- How about using a learning-based method to measure the prediction uncertainty?
(to be continued...)
If you find this work is useful for your research, please consider citing:
@misc{xu2021reliable,
title={Reliable Propagation-Correction Modulation for Video Object Segmentation},
author={Xiaohao Xu and Jinglu Wang and Xiao Li and Yan Lu},
year={2021},
eprint={2112.02853},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
CFBI: https://github.com/z-x-yang/CFBI
Deeplab: https://github.com/VainF/DeepLabV3Plus-Pytorch
GCT: https://github.com/z-x-yang/GCT
Semisupervised video object segmentation repos:
AOT [NeurIPS 2021]: https://github.com/z-x-yang/AOT
STCN [NeurIPS 2021]: https://github.com/hkchengrex/STCN
JOINT [ICCV 2021]: https://github.com/maoyunyao/JOINT
HMMN [ICCV 2021]: https://github.com/Hongje/HMMN
DMN-AOA [ICCV 2021]: https://github.com/liang4sx/DMN-AOA
MiVOS [CVPR 2021]: https://github.com/hkchengrex/MiVOS
SSTVOS [CVPR 2021]: https://github.com/dukebw/SSTVOS
GraphMemVOS [ECCV 2020]: https://github.com/carrierlxk/GraphMemVOS
AFB-URR [NeurIPS 2020]: https://github.com/xmlyqing00/AFB-URR
CFBI [ECCV 2020]: https://github.com/z-x-yang/CFBI
FRTM-VOS [CVPR 2020]: https://github.com/andr345/frtm-vos
STM [ICCV 2019]: https://github.com/seoungwugoh/STM
FEELVOS [CVPR 2019]: https://github.com/kim-younghan/FEELVOS
(The list may be incomplete, feel free to contact me by pulling a issue and I'll add them on!)
The 1st Large-scale Video Object Segmentation Challenge: https://competitions.codalab.org/competitions/19544#learn_the_details
The 2nd Large-scale Video Object Segmentation Challenge - Track 1: Video Object Segmentation: https://competitions.codalab.org/competitions/20127#learn_the_details
The Semi-Supervised DAVIS Challenge on Video Object Segmentation @ CVPR 2020: https://competitions.codalab.org/competitions/20516#participate-submit_results
DAVIS: https://davischallenge.org/
YouTube-VOS: https://youtube-vos.org/
Papers with code for Semi-VOS: https://paperswithcode.com/task/semi-supervised-video-object-segmentation
Some Q&As about the project from the readers are listed as follows.
Q1:I have noticed that the performance in youtubevos is very good, and I wonder what you think might be the reason?
Error propagation is a critical problem for most of the models in VOS as well as other tracking-related fileds. The main reason for the inprovement of our model is due to some designs to suppress error from propagation. Specificly, we propose an assembly of propagation and correction modulators to fully leverage the reference guidance during propagation. Apart from the reliable guidance from the reference, we also consider leveraging the reliable cues according to the historical predictions. To be specific, we use Shannon entropy as a measure of prediction uncertainty for further reliable object cues augmentation.
Q2:When you were training, did you randomly cut the images to 465x465, consistent with CFBI?
Yes. We mainly follow the training protocal used in CFBI. (Based on some observations, I think certain data augmentation methods may lead to some bias in training samples, which may futher lead to a gap between training and inference. However, I havn't verified this viewpoint concisely.)
Firstly, the author would like to thank Rex for his insightful viewpoints about VOS during e-mail discussion! Also, this work is built upon CFBI. Thanks for the author of CFBI to release such a wonderful code repo for further work to build upon!
Xiaohao Xu: xxh11102019@outlook.com
This project is released under the Mit license. See LICENSE for additional details.