BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video (WACV'23)

Ali Athar, Jonathon Luiten, Paul Voigtlaender, Tarasha Khurana, Achal Dave, Bastian Leibe, Deva Ramanan

PDF | Bibtex | Dataset Images | Dataset Annotations | Leaderboard

TL;DR

BURST is a dataset/benchmark for object segmentation in video. It contains a total of 2,914 videos with pixel-precise segmentation masks for 16,089 unique object tracks (600,000 per-frame masks) spanning 482 object classes. It is based on the existing TAO dataset which contains box-level annotations which we extended to pixel-precise masks.

Updates

09-08-2023: Eval code improvement: Cleaner output format and no dependency on TrackEval. The results have not changed.
15-04-2023: Leaderboard is up on paperswithcode
03-04-2023: We're organizing a workshop at CVPR'23 based on the BURST open-world tracking task! See workshop page for more details
27-03-2023: STCN tracker baseline is available.
24-11-2022: Evaluation code is now available.
24-09-2022: Dataset annotations are public.

Task Taxonomy

Paper Abstract

Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets and metrics (e.g. J&F, mAP, sMOTSA). As a result, published works usually target a particular benchmark, and are not easily comparable to each another. We believe that the development of generalized methods that can tackle multiple tasks requires greater cohesion among these research sub-communities. In this paper, we aim to facilitate this by proposing BURST, a dataset which contains thousands of diverse videos with high-quality object masks, and an associated benchmark with six tasks involving object tracking and segmentation in video. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison, and hence, more effectively pool knowledge from different methods across different tasks. Additionally, we demonstrate several baselines for all tasks and show that approaches for one task can be applied to another with a quantifiable and explainable performance difference.

NOTE: The annotations in this dataset are not exhaustive i.e. not every object belonging to the dataset class set is annotated. We do, however, provide two fields per video which convey (1) which classes are present but not exhaustively annotated, and (2) which classes are definitely not present in the video. This follows the format of the LVIS dataset.

Dataset Download

Image sequences: Available from the MOTChallenge website.
Annotations: Available from RWTH omnomnom

The annotations are organized in the following directory structure:

- train:
  - all_classes.json
- val:
  - all_classes.json
  - common_classes.json
  - uncommon_classes.json
  - first_frame_annotations.json
- test:
  - all_classes.json
  - common_classes.json
  - uncommon_classes.json
  - first_frame_annotations.json
- info:
  - categories.json
  - class_split.json

For each split, all_classes.json is the primary file containing all mask annotations. The others are a sub-set of those: common_classes.json and uncommon_classes.json only contain object tracks belonging to the corresponding class split (see class_split.json). The first_frame_annotations.json file is relevant only for the exemplar-guided tasks since it contains the annotations for each object track in only the first frame where it occurs. This can be easily deduced from the primary annotations file as well, but we provide it separately for ease of use.

NOTE: In contrast to other datasets, we have decided to make the test set annotations public. Remember though: with great power comes great responsibility. Please use the test set fairly when reporting scores for your methods.

Setup

We tested the API code on python 3.7. To install dependencies run:

pip install -r requirements.txt

This will install OpenCV and Pillow since those are used by the demo code for visualization. If you only want to run the evaluation then you can get by with the smaller/lighter set of dependencies in requirements_eval.txt.

Parsing and Visualization

Please refer to burstapi/dataset.py for example code to parse the annotations.

Assuming the images and annotation files are downloaded, you can visualize the masks by running the following:

python burstapi/demo.py --images_base_dir /path/to/dataset/images --annotations_file /path/to/any/of/the/annotation/files

--images_base_dir should have three sub-folders in it: train, val and test with the images for each of those splits.
When running the demo script for one of the first_frame_annotations files, also include an additional --first_frame_annotations argument to the above command. The demo script will then also show the first-frame exemplar point.

Evaluation

Your results should be in a single JSON file in the same format as the ground-truth (see annotation format). Then run the eval script as follows:

python burstapi/eval/run.py --pred /path/to/your/predictions.json --gt /path/to/directory/with/gt_annotations --task {class_guided,exemplar_guided,open_world}

You can also write out the metrics to disk by giving an additional argument: --output /path/to/output.json

Important: If you predict a null mask for any object in a given frame, you should completely omit the per-frame entry for that object ID. Having RLE-encoded null masks in your predictions will negatively effect the score (we will fix this in the future).

Frame-rate: The val and test sets are evaluated at 1FPS. The eval code can handle result files with arbitrary frame rates (the predicted masks for un-annotated frames are simply ignored).

Acknowledgements: The eval code is largely copy/pasted from Jonathon Luiten's TrackEval repo.

Leaderboard

Leaderboard is available on paperswithcode. There is no dedicated eval server since the test set annotations are public. We have listed all 6 BURST tasks on paperswithcode with validation and test sets listed separately i.e. there are a total of 6*2=12 leaderboards

Baselines

See baselines/README.md

Cite

@inproceedings{athar2023burst,
  title={BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video},
  author={Athar, Ali and Luiten, Jonathon and Voigtlaender, Paul and Khurana, Tarasha and Dave, Achal and Leibe, Bastian and Ramanan, Deva},
  booktitle={WACV},
  year={2023}
}

Ali2500 / BURST-benchmark