Benchmark results not consistent for Mask R-CNN Spinenet-49

Question

Benchmark results not consistent for Mask R-CNN Spinenet-49

yusufdalva opened this issue 3 years ago · comments

Hi,
I am trying to reproduce the AP scores reported for the Mask R-CNN model with Spinenet-49 and Spinenet-96 backbones. When I run main.py on eval mode, the results I got are lower than expected (results reported on MODEL ZOO). Here are the results I got for Spinenet-49 backbone:

Evaluate annotation type *bbox*
DONE (t=54.10s).
`Accumulating evaluation results...
DONE (t=8.79s).
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.392
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.571
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.427
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.162
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.438
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.608
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.337
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.529
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.565
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.370
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.603
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.71
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=59.38s).
Accumulating evaluation results...
DONE (t=8.46s).
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.350
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.549
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.375
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.131
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.396
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.56
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.311
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.477
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.506
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.305
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.552
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.674

In MODEL ZOO, the box AP and mask AP were 42.8 and 37.8. The results I got are 39.2 and 35.0.

To run the evaluation, here is the command I used:

MODEL_DIR=<path to directory with checkpoint files>
DATA_DIR=<path to directory with tfrecord files for COCO 2017 validation set>
EVAL_DIR=<path to instances_val2017.json>
MODE="eval"
CONFIG_PATH=<path to tpu directory>/tpu/models/official/detection/configs/spinenet/spinenet49_mrcnn.yaml
CHECKPOINT_PATH=<path to directory with checkpoint files>/model.ckpt
BATCH_SIZE=8
export PYTHONPATH=$PYTHONPATH:<path to tpu directory>/tpu/models/official/efficientnet
export CUDA_VISIBLE_DEVICES=0

python3 official/detection/main.py \
        --model="mask_rcnn" \
        --model_dir=$MODEL_DIR \
        --checkpoint_path=$CHECKPOINT_PATH \
        --mode=$MODE \
        --use_tpu=False \
        --export_to_tpu=False \
        --config_file=$CONFIG_PATH \
        --params_override="{train: {train_batch_size: $BATCH_SIZE}, eval: {val_json_file: $EVAL_DIR, eval_file_pattern: $DATA_DIR, eval_batch_size: $BATCH_SIZE, eval_samples: 5000}}"

I run the evaluation code on my local machine. My specs are:

GPU: NVIDIA RTX 2080-TI
Tensorflow version: 2.4.1
CUDA version: 11.0
Branch: r2.4
I used the checkpoints given in MODEL ZOO.

There is also a problem with the Spinenet-143 backbone. The results I got are lower than the Spinenet-49 backbone and inconsistent with the scores reported in MODEL ZOO. The supplied tar file misses the file "checkpoint" also. The results for Spinenet-143 bakckbone is as follows:

Evaluate annotation type *bbox*
DONE (t=57.48s).
Accumulating evaluation results...
DONE (t=8.53s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.384
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.547
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.415
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.168
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.469
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.646
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.351
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.551
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.599
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.423
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.627
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.743
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=61.33s).
Accumulating evaluation results...
DONE (t=8.39s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.345
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.531
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.374
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.146
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.422
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.595
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.323
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.541
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.364
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.570
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.691

Is there a problem with the checkpoints provided? Any help would be appreciated. Thanks!