Regarding COCO2014 and COCO2017 evaluation.

Question

Regarding COCO2014 and COCO2017 evaluation.

opened this issue 4 years ago · comments

Hello,

First of all I want to say great work on this implementation of YoloV3.

I have a question regarding results for the COCO evaluation.

The tables reported in the README are obtained from evaluating on coco2014, and are consistent with the original YoloV3 paper, e.g. 0.33 mAP on YoloV3 - 608 scale.

But In the paper this is supposed to be the coco2017.

I've been also comparing your method with the current supposed SOTA of EfficientDet. It is unofficial but it has reproduced results from the paper. This implementation gives the results of the paper (e.g. for efficientnet-D4) of mAP 0.48 on the coco2017 subset. It also compares the tables against YoloV3 in the first table in the paper, the 33 mAP aforementioned.

But if I evaluate on your coco2017, I get much higher results (0.407 mAP with Yolov3).

So to summarize:

Is coco2014 actually coco2017 from the paper?
Is this coco2017 different than other repositories? Does the image list & label list differ?

Thank you in advance.

github-actions · Answer 1 · Fri Mar 27 2020 22:14:39 GMT+0800 (China Standard Time)

Hello @bnbhehe, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

Glenn Jocher · Answer 2 · Sat Mar 28 2020 01:22:11 GMT+0800 (China Standard Time)

@bnbhehe we report 44.0 mAP on COCO 2014 here:
https://github.com/ultralytics/yolov3#map

See get_coco2014.sh and get_coco2017.sh in data/

Glenn Jocher · Answer 3 · Sat Mar 28 2020 01:24:35 GMT+0800 (China Standard Time)

@bnbhehe also, effificentdet is much slower than yolov3. Our inference time at 44 mAP is about 13 ms, compared to 50-100+ms for most of the comparable efficientdet models.

Deleted user · Answer 4 · Sat Mar 28 2020 01:31:42 GMT+0800 (China Standard Time)

@bnbhehe we report 44.0 mAP on COCO 2014 here:
https://github.com/ultralytics/yolov3#map

See get_coco2014.sh and get_coco2017.sh in data/

Yes, I am aware of these results, already tested them myself, for all yolo pretrained checkpoints (normal, spp, spp-ultralytics) on the 608 image scale. I can get the results reported when evaluating the 2014 data.

My question is more about what's the deal with the coco2017 validation set that I get through your bash script.
Is it the official one? Does it correspond to what other papers report as the current validation set? If I use your bash scripts to download the 2017 version, and evaluate, I get some really high results that surpass most slow detectors on SOTA.

@bnbhehe also, effificentdet is much slower than yolov3. Our inference time at 44 mAP is about 13 ms, compared to 50-100+ms for most of the comparable efficientdet models.

I am aware of the speed estimates. I've verified myself and I can say, yes it's pretty much faster :), so mAP comparisons make sense.

As I aforementioned, my question is more about the results being reported, where efficientdet reports this as 2017 results, while the yolo paper & your codebase reports this as 2014 results.

Glenn Jocher · Answer 5 · Sat Mar 28 2020 01:41:46 GMT+0800 (China Standard Time)

@bnbhehe all of the models here are trained on COCO2014, so you can only test them on COCO 2014.

Glenn Jocher · Answer 6 · Sat Mar 28 2020 01:42:38 GMT+0800 (China Standard Time)

And yes, of course the COCO models returned by the scripts are correrct. If you look at the script they are downloaded form the official site!

#!/bin/bash
# Zip coco folder
# zip -r coco.zip coco
# tar -czvf coco.tar.gz coco

# Download labels from Google Drive, accepting presented query
filename="coco2017labels.zip"
fileid="1cXZR_ckHki6nddOmcysCuuJFM--T-Q6L"
curl -c ./cookie -s -L "https://drive.google.com/uc?export=download&id=${fileid}" > /dev/null
curl -Lb ./cookie "https://drive.google.com/uc?export=download&confirm=`awk '/download/ {print $NF}' ./cookie`&id=${fileid}" -o ${filename}
rm ./cookie

# Unzip labels
unzip -q ${filename}  # for coco.zip
# tar -xzf ${filename}  # for coco.tar.gz
rm ${filename}

# Download and unzip images
cd coco/images
f="train2017.zip" && curl http://images.cocodataset.org/zips/$f -o $f && unzip -q $f && rm $f
f="val2017.zip" && curl http://images.cocodataset.org/zips/$f -o $f && unzip -q $f && rm $f

# cd out
cd ../..

Deleted user · Answer 7 · Sat Mar 28 2020 01:51:36 GMT+0800 (China Standard Time)

@bnbhehe all of the models here are trained on COCO2014, so you can only test them on COCO 2014.

So If I Follow the README and try to reproduce, the default parameters point that I am training on 2017. Is this a mismatch in documentation? Or I am going to obtain the same results as 2014?

And yes, of course the COCO models returned by the scripts are correrct. If you look at the script they are downloaded form the official site!

#!/bin/bash
# Zip coco folder
# zip -r coco.zip coco
# tar -czvf coco.tar.gz coco

# Download labels from Google Drive, accepting presented query
filename="coco2017labels.zip"
fileid="1cXZR_ckHki6nddOmcysCuuJFM--T-Q6L"
curl -c ./cookie -s -L "https://drive.google.com/uc?export=download&id=${fileid}" > /dev/null
curl -Lb ./cookie "https://drive.google.com/uc?export=download&confirm=`awk '/download/ {print $NF}' ./cookie`&id=${fileid}" -o ${filename}
rm ./cookie

# Unzip labels
unzip -q ${filename}  # for coco.zip
# tar -xzf ${filename}  # for coco.tar.gz
rm ${filename}

# Download and unzip images
cd coco/images
f="train2017.zip" && curl http://images.cocodataset.org/zips/$f -o $f && unzip -q $f && rm $f
f="val2017.zip" && curl http://images.cocodataset.org/zips/$f -o $f && unzip -q $f && rm $f

# cd out
cd ../..

Yes, but the labels are from another repository (google drive). I suppose this is just some manual thing to extract the labels from the instances_train/val2017.json of the official code, just so you can load them with your current dataset loader?

Glenn Jocher · Answer 8 · Sat Mar 28 2020 03:30:39 GMT+0800 (China Standard Time)

@bnbhehe the google drive file is the json labels parsed into text format.

2014 and 2017 are the same images, just divided differently.

Glenn Jocher · Answer 9 · Sat Mar 28 2020 03:36:42 GMT+0800 (China Standard Time)

@bnbhehe I think we may need to revise the reported mAPs a bit lower, from 0.44 to maybe 0.423.

It looks like a recent update of ours combined with a bug in pycocotools to produce artificially higher mAPs (!). I'm still looking into it.

Deleted user · Answer 10 · Sat Mar 28 2020 03:43:49 GMT+0800 (China Standard Time)

Well if it's any help, by using the following command:

python test.py  --img-size 608 --cfg  cfg/yolov3-spp.cfg --weights weights/yolov3-spp-ultralytics.pt  --data data/coco2017.data

I get the following table

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.641
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.878
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.707
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.429
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.634
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.812
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.295
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.681
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.733
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.555
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.732
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.878

which is a bit of insanely high results for COCO,even if 2017 is a different and easier split.

My pip package dependencies are:

pycocotools==2.0.0
torch==1.3.1

Thank you for your feedback.

Glenn Jocher · Answer 11 · Sat Mar 28 2020 03:49:27 GMT+0800 (China Standard Time)

@bnbhehe yes, this is expected behavior. This is happening because yolov3-spp-ultralytics.pt is trained on COCO2014, which has a different train/test split than COCO2017, so it is 'cheating' since it has seen some of the 2017 test images during training, and thus does a better job on them.

A more accurate mAP is about 42/62, which is what you should get when testing on COCO2014.

Of course, if you train from scratch on COCO2017, then you should also get about the same mAP when testing on COCO2017.

Glenn Jocher · Answer 12 · Sat Mar 28 2020 03:50:41 GMT+0800 (China Standard Time)

This is the correct result I believe, after fixing an unrelated bug I mentioned before.

$ python3 test.py --data coco2014.data --img 608 --iou 0.7

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.424
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.614
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.462
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.252
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.538
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.569
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.628
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.464
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.675
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.755

Deleted user · Answer 13 · Sat Mar 28 2020 03:54:10 GMT+0800 (China Standard Time)

@bnbhehe yes, this is expected behavior. This is happening because yolov3-spp-ultralytics.pt is trained on COCO2014, which has a different train/test split than COCO2017, so it is 'cheating' since it has seen some of the 2017 test images during training, and thus does a better job on them.

A more accurate mAP is about 42/62, which is what you should get when testing on COCO2014.

Of course, if you train from scratch on COCO2017, then you should also get about the same mAP when testing on COCO2017.

Ok so it seems when I am comparing with the literature, I have to evaluate on coco2014, to be able to compare with other papers. I think I understand what you mean about this train/test split, I've seen it in the official coco site If I recall.

This is the correct result I believe, after fixing an unrelated bug I mentioned before.

$ python3 test.py --data coco2014.data --batch 32 --img 608 --iou 0.7

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.424
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.614
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.462
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.252
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.538
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.569
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.628
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.464
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.675
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.755

Thank you for going through the effort. Is the bug going to be fixed here or do I have to update my dependencies?

I think my issue has been answered nevertheless. Thank you for your help!

Glenn Jocher · Answer 14 · Sat Mar 28 2020 04:02:09 GMT+0800 (China Standard Time)

Yes I'm going to fix the bug today on our end (and update the mAP section with corrections), and I'll raise the issue over on the pycocotools repo, and hopefully they will fix it there.

Yes to compare to the literature you should use COCO2014.

BTW, if you turn on the --augment flag then test-time augmentation is performed, which helps a bit also (often times still staying faster than efficientdet). I would say if you need the very best mAP, efficientdet is the way to go, but for real-world use, like production systems where you have to analyze multiple video streams in realtime etc, yolov3 is still better, simply since it is much faster.

$ python3 test.py --data coco2014.data --img 608 --iou 0.7 --augment
Speed: 19.9/3.1/23.0 ms inference/NMS/total per 608x608 image at batch-size 16

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.447
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.628
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.490
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.255
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.491
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.602
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.354
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.590
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.665
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.486
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.717
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.806

github-actions · Answer 15 · Mon Apr 27 2020 08:07:44 GMT+0800 (China Standard Time)

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

ZHUXUHAN · Answer 16 · Wed Jul 01 2020 20:09:36 GMT+0800 (China Standard Time)

Well if it's any help, by using the following command:

python test.py  --img-size 608 --cfg  cfg/yolov3-spp.cfg --weights weights/yolov3-spp-ultralytics.pt  --data data/coco2017.data

I get the following table

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.641
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.878
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.707
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.429
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.634
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.812
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.295
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.681
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.733
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.555
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.732
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.878

which is a bit of insanely high results for COCO,even if 2017 is a different and easier split.

My pip package dependencies are:

pycocotools==2.0.0
torch==1.3.1

Thank you for your feedback.

Glenn Jocher · Answer 17 · Wed Nov 15 2023 01:32:15 GMT+0800 (China Standard Time)

@ZHUXUHAN thank you for your response. It seems like there may be some confusion regarding the expected results. When using the YOLOv3 model from the Ultralytics repo and testing on COCO2017 with the specified dependencies, I'd recommend referring to the COCO test-dev2017 split for evaluating results. For further clarity, I suggest verifying the testing environment and procedures to ensure a consistent comparison with other systems.

As an additional point, it's important to keep in mind that different hardware or software configurations may also impact results. Feel free to reach out if you have any further questions or need assistance.