Benchmark results not consistent for Mask R-CNN Spinenet-49
yusufdalva opened this issue · comments
Hi,
I am trying to reproduce the AP scores reported for the Mask R-CNN model with Spinenet-49 and Spinenet-96 backbones. When I run main.py on eval mode, the results I got are lower than expected (results reported on MODEL ZOO). Here are the results I got for Spinenet-49 backbone:
Evaluate annotation type *bbox*
DONE (t=54.10s).
`Accumulating evaluation results...
DONE (t=8.79s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.392
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.571
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.427
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.162
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.438
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.608
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.337
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.529
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.565
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.370
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.603
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.71
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=59.38s).
Accumulating evaluation results...
DONE (t=8.46s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.350
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.549
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.375
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.131
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.396
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.56
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.311
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.477
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.506
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.305
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.552
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.674
In MODEL ZOO, the box AP and mask AP were 42.8 and 37.8. The results I got are 39.2 and 35.0.
To run the evaluation, here is the command I used:
MODEL_DIR=<path to directory with checkpoint files>
DATA_DIR=<path to directory with tfrecord files for COCO 2017 validation set>
EVAL_DIR=<path to instances_val2017.json>
MODE="eval"
CONFIG_PATH=<path to tpu directory>/tpu/models/official/detection/configs/spinenet/spinenet49_mrcnn.yaml
CHECKPOINT_PATH=<path to directory with checkpoint files>/model.ckpt
BATCH_SIZE=8
export PYTHONPATH=$PYTHONPATH:<path to tpu directory>/tpu/models/official/efficientnet
export CUDA_VISIBLE_DEVICES=0
python3 official/detection/main.py \
--model="mask_rcnn" \
--model_dir=$MODEL_DIR \
--checkpoint_path=$CHECKPOINT_PATH \
--mode=$MODE \
--use_tpu=False \
--export_to_tpu=False \
--config_file=$CONFIG_PATH \
--params_override="{train: {train_batch_size: $BATCH_SIZE}, eval: {val_json_file: $EVAL_DIR, eval_file_pattern: $DATA_DIR, eval_batch_size: $BATCH_SIZE, eval_samples: 5000}}"
I run the evaluation code on my local machine. My specs are:
- GPU: NVIDIA RTX 2080-TI
- Tensorflow version: 2.4.1
- CUDA version: 11.0
- Branch: r2.4
I used the checkpoints given in MODEL ZOO.
There is also a problem with the Spinenet-143 backbone. The results I got are lower than the Spinenet-49 backbone and inconsistent with the scores reported in MODEL ZOO. The supplied tar file misses the file "checkpoint" also. The results for Spinenet-143 bakckbone is as follows:
Evaluate annotation type *bbox*
DONE (t=57.48s).
Accumulating evaluation results...
DONE (t=8.53s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.384
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.547
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.415
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.168
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.469
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.646
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.351
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.551
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.599
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.423
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.627
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.743
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=61.33s).
Accumulating evaluation results...
DONE (t=8.39s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.345
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.531
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.374
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.146
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.422
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.595
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.323
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.541
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.364
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.570
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.691
Is there a problem with the checkpoints provided? Any help would be appreciated. Thanks!